Paperid:1
Authors:Yongkang Wang, Xuan Liu, Feng Huang, Zhankun Xiong, Wen Zhang
College of Informatics, Huazhong Agricultural University, Wuhan 430070, China, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan 430070,China
Title: A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation
Abstract:
Therapeutic peptides represent a unique class of pharmaceutical agents crucial for the treatment of human diseases. Recently, deep generative models have exhibited remarkable potential for generating therapeutic peptides, but they only utilize sequence or structure information alone, which hinders the performance in generation. In this study, we propose a MultiModal Contrastive Diffusion model (MMCD), fusing both sequence and structure modalities in a diffusion framework to co-generate novel peptide sequences and structures. Specifically, MMCD constructs the sequence-modal and structure-modal diffusion models, respectively, and devises a multi-modal contrastive learning strategy with inter-contrastive and intra-contrastive in each diffusion timestep, aiming to capture the consistency between two modalities and boost model performance. The inter-contrastive aligns sequences and structures of peptides by maximizing the agreement of their embeddings, while the intra-contrastive differentiates therapeutic and non-therapeutic peptides by maximizing the disagreement of their sequence/structure embeddings simultaneously. The extensive experiments demonstrate that MMCD performs better than other state-of-the-art deep generative methods in generating therapeutic peptides across various metrics, including antimicrobial/anticancer score, diversity, and peptide-docking.



Paperid:2
Authors:Chen Bai, Jianwang Zhai, Yuzhe Ma, Bei Yu, Martin D. F. Wong
The Chinese University of Hong Kong, Beijing University of Posts and Telecommunications, The Hong Kong University of Science and Technology (Guangzhou), The Chinese University of Hong Kong, Hong Kong Baptist University
Abstract:
Microarchitecture determines the implementation of a microprocessor. Designing a microarchitecture to achieve better performance, power, and area (PPA) tradeoff has been increasingly difficult. Previous data-driven methodologies hold inappropriate assumptions and lack more tightly coupling with expert knowledge. This paper proposes a novel reinforcement learning-based (RL) solution that addresses these limitations. With the integration of microarchitecture scaling graph, PPA preference space embedding, and proposed lightweight environment in RL, experiments using commercial electronic design automation (EDA) tools show that our method achieves an average PPA trade-off improvement of 16.03% than previous state-of-the-art approaches with 4.07× higher efficiency. The solution qualities outperform human implementations by at most 2.03× in the PPA trade-off.



Paperid:3
Authors:Shreyas Bhat Brahmavar, Ashwin Srinivasan, Tirtharaj Dash, Sowmya Ramaswamy Krishnan, Lovekesh Vig, Arijit Roy, Raviprasad Aduri
Department of Electrical and Electronics Engineering, BITS Pilani, Goa Campus, India Department of Biological Sciences, BITS Pilani, Goa Campus, India, Department of Computer Science, BITS Pilani, Goa Campus, India, Department of Pediatrics, University of California, San Diego, USA, TCS Innovation Labs (Life Sciences Division), India, TCS Research, India, TCS Innovation Labs (Life Sciences Division), India, Department of Biological Sciences, BITS Pilani, Goa Campus, India
Abstract:
Large Language Models (LLMs) can be used as repositories of biological and chemical information to generate pharmacological lead compounds. However, for LLMs to focus on specific drug targets typically requires experimentation with progressively more refined prompts. Results thus become dependent not just on what is known about the target, but also on what is known about the promptengineering. In this paper, we separate the prompt into domain-constraints that can be written in a standard logical form and a simple text-based query. We investigate whether LLMs can be guided, not by refining prompts manually, but by refining the logical component automatically, keeping the query unchanged. We describe an iterative procedure LMLF (“Language Model with Logical Feedback”) in which the constraints are progressively refined using a logical notion of generalisation. On any iteration, newly generated instances are verified against the constraint, providing "logical-feedback" for the next iteration's refinement of the constraints. We evaluate LMLF using two well-known targets (inhibition of the Janus Kinase 2; and Dopamine Receptor D2); and two different LLMs (GPT-3 and PaLM). We show that LMLF, starting with the same logical constraints and query text, can be used to guide both LLMs to generate potential leads. We find: (a) Binding affinities of LMLF-generated molecules are skewed towards higher binding affinities than those from existing baselines; (b) LMLF results in generating molecules that are skewed towards higher binding affinities than without logical feedback; (c) Assessment by a computational chemist suggests that LMLF generated compounds may be novel inhibitors. These findings suggest that LLMs with logical feedback may provide a mechanism for generating new leads without requiring the domain-specialist to acquire sophisticated skills in prompt-engineering.



Paperid:4
Authors:Ying-Ying Chang, Wei-Yao Wang, Wen-Chih Peng
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University
Abstract:
In the dynamic and rapidly evolving world of social media, detecting anomalous users has become a crucial task to address malicious activities such as misinformation and cyberbullying. As the increasing number of anomalous users improves the ability to mimic normal users and evade detection, existing methods only focusing on bot detection are ineffective in terms of capturing subtle distinctions between users. To address these challenges, we proposed SeGA, preferenceaware self-contrastive learning for anomalous user detection, which leverages heterogeneous entities and their relations in the Twittersphere to detect anomalous users with different malicious strategies. SeGA utilizes the knowledge of large language models to summarize user preferences via posts. In addition, integrating user preferences with prompts as pseudo-labels for preference-aware self-contrastive learning enables the model to learn multifaceted aspects for describing the behaviors of users. Extensive experiments on the proposed TwBNT benchmark demonstrate that SeGA significantly outperforms the state-of-the-art methods (+3.5% ∼ 27.6%) and empirically validate the effectiveness of the model design and pre-training strategies. Our code and data are publicly available at https://github.com/ying0409/SeGA.



Paperid:5
Authors:Zhihao Chang, Linzhu Yu, Yanchao Xu, Wentao Hu
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang Police college
Abstract:
Biological sequence nearest neighbor search plays a fundamental role in bioinformatics. To alleviate the pain of quadratic complexity for conventional distance computation, neural distance embeddings, which project sequences into geometric space, have been recognized as a promising paradigm. To maintain the distance order between sequences, these models all deploy triplet loss and use intuitive methods to select a subset of triplets for training from a vast selection space. However, we observed that such training often enables models to distinguish only a fraction of distance orders, leaving others unrecognized. Moreover, naively selecting more triplets for training under the stateof-the-art network not only adds costs but also hampers model performance. In this paper, we introduce Bio-kNN: a kNN search framework for biological sequences. It includes a systematic triplet selection method and a multi-head network, enhancing the discernment of all distance orders without increasing training expenses. Initially, we propose a clustering-based approach to partition all triplets into several clusters with similar properties, and then select triplets from these clusters using an innovative strategy. Meanwhile, we noticed that simultaneously training different types of triplets in the same network cannot achieve the expected performance, thus we propose a multi-head network to tackle this. Our network employs a convolutional neural network(CNN) to extract local features shared by all clusters, and then learns a multi-layer perception(MLP) head for each cluster separately. Besides, we treat CNN as a special head, thereby integrating crucial local features which are neglected in previous models into our model for similarity recognition. Extensive experiments show that our Bio-kNN significantly outperforms the state-of-the-art methods on two large-scale datasets without increasing the training cost.



Paperid:6
Authors:Haoyang Chen, Peiyan Sun, Qiyuan Song, Wanyuan Wang, Weiwei Wu, Wencan Zhang, Guanyu Gao, Yan Lyu
Southeast University, Southeast University, Southeast University, Southeast University, Southeast University, National University of Singapore, Nanjing University of Science and Technology, Southeast University
Abstract:
Ridehailing platforms have been facing the challenge of balancing demand and supply. Existing vehicle reposition techniques often treat drivers as homogeneous agents and relocate them deterministically, assuming compliance with the reposition. In this paper, we consider a more realistic and driver-centric scenario where drivers have unique cruising preferences and can decide whether to take the recommendation or not on their own. We propose i-Rebalance, a personalized vehicle reposition technique with deep reinforcement learning (DRL). i-Rebalance estimates drivers' decisions on accepting reposition recommendations through an on-field user study involving 99 real drivers. To optimize supply-demand balance and enhance preference satisfaction simultaneously, i-Rebalance has a sequential reposition strategy with dual DRL agents: Grid Agent to determine the reposition order of idle vehicles, and Vehicle Agent to provide personalized recommendations to each vehicle in the pre-defined order. This sequential learning strategy facilitates more effective policy training within a smaller action space compared to traditional joint-action methods. Evaluation of real-world trajectory data shows that i-Rebalance improves driver acceptance rate by 38.07% and total driver income by 9.97%.



Paperid:7
Authors:Le Cheng, Peican Zhu, Keke Tang, Chao Gao, Zhen Wang
Northwestern Polytechnical University, Northwestern Polytechnical University, Guangzhou University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Source detection in graphs has demonstrated robust efficacy in the domain of rumor source identification. Although recent solutions have enhanced performance by leveraging deep neural networks, they often require complete user data. In this paper, we address a more challenging task, rumor source detection with incomplete user data, and propose a novel framework, i.e., Source Detection in Graphs with Incomplete Nodes via Positional Encoding and Attentive Fusion (GINSD), to tackle this challenge. Specifically, our approach utilizes a positional embedding module to distinguish nodes that are incomplete and employs a self-attention mechanism to focus on nodes with greater information transmission capacity. To mitigate the prediction bias caused by the significant disparity between the numbers of source and non-source nodes, we also introduce a class-balancing mechanism. Extensive experiments validate the effectiveness of GIN-SD and its superiority to state-of-the-art methods.



Paperid:8
Authors:Yoni Choukroun, Lior Wolf
Tel Aviv University, Tel Aviv University, Israel
Abstract:
Quantum error correction codes (QECC) are a key component for realizing the potential of quantum computing. QECC, as its classical counterpart (ECC), enables the reduction of error rates, by distributing quantum logical information across redundant physical qubits, such that errors can be detected and corrected. In this work, we efficiently train novel endto-end deep quantum error decoders. We resolve the quantum measurement collapse by augmenting syndrome decoding to predict an initial estimate of the system noise, which is then refined iteratively through a deep neural network. The logical error rates calculated over finite fields are directly optimized via a differentiable objective, enabling efficient decoding under the constraints imposed by the code. Finally, our architecture is extended to support faulty syndrome measurement, by efficient decoding of repeated syndrome sampling. The proposed method demonstrates the power of neural decoders for QECC by achieving state-of-the-art accuracy, outperforming for small distance topological codes, the existing end-to-end neural and classical decoders, which are often computationally prohibitive.



Paperid:9
Authors:Chaoqun Cui, Caiyan Jia
School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University
Abstract:
Rumor detection on social media has become increasingly important. Most existing graphbased models presume rumor propagation trees (RPTs) have deep structures and learn sequential stance features along branches. However, through statistical analysis on real-world datasets, we find RPTs exhibit wide structures, with most nodes being shallow 1-level replies. To focus learning on intensive substructures, we propose Rumor Adaptive Graph Contrastive Learning (RAGCL) method with adaptive view augmentation guided by node centralities. We summarize three principles for RPT augmentation: 1) exempt root nodes, 2) retain deep reply nodes, 3) preserve lower-level nodes in deep sections. We employ node dropping, attribute masking and edge dropping with probabilities from centrality-based importance scores to generate views. A graph contrastive objective then learns robust rumor representations. Extensive experiments on four benchmark datasets demonstrate RAGCL outperforms state-of-the-art methods. Our work reveals the wide-structure nature of RPTs and contributes an effective graph contrastive learning approach tailored for rumor detection through principled adaptive augmentation. The proposed principles and augmentation techniques can potentially benefit other applications involving tree-structured graphs.



Paperid:10
Authors:Longchao Da, Minquan Gao, Hao Mei, Hua Wei
Arizona State University, Johns Hopkins University, Arizona State University, Arizona State University
Abstract:
Numerous solutions are proposed for the Traffic Signal Control (TSC) tasks aiming to provide efficient transportation and alleviate traffic congestion. Recently, promising results have been attained by Reinforcement Learning (RL) methods through trial and error in simulators, bringing confidence in solving cities' congestion problems. However, performance gaps still exist when simulatortrained policies are deployed to the real world. This issue is mainly introduced by the system dynamic difference between the training simulators and the real-world environments. In this work, we leverage the knowledge of Large Language Models (LLMs) to understand and profile the system dynamics by a prompt-based grounded action transformation to bridge the performance gap. Specifically, this paper exploits the pre-trained LLM's inference ability to understand how traffic dynamics change with weather conditions, traffic states, and road types. Being aware of the changes, the policies' action is taken and grounded based on realistic dynamics, thus helping the agent learn a more realistic policy. We conduct experiments on four different scenarios to show the effectiveness of the proposed PromptGAT's ability to mitigate the performance gap of reinforcement learning from simulation to reality (sim-to-real).



Paperid:11
Authors:Na Fan, Zeyue Tian, Amartansh Dubey, Samruddhi Deshmukh, Ross Murch, Qifeng Chen
The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology
Abstract:
Devicefree localization (DFL) using easily-obtained Wi-Fi received signal strength (RSS) has wide real-world applications for not requiring people to carry trackable devices. However, accurate multitarget DFL remains challenging due to the unknown number of targets, multipath interference (MPI), especially between nearby targets, and limited real-world data. In this study, we pioneeringly propose a transformer-based learning method with Wi-Fi RSS as input, and an attentional prior fusion module, to simultaneously locate an unknown number of people at random positions. To overcome the multitarget data collection challenges, we contribute a large-scale cross-domain real-simulation-augmentation training dataset with one and two real-world nearby non-person objects at limited positions and up to five simulated and augmented randomly distributed targets. Experimental results demonstrate our method's improved accuracy, generalization ability, and robustness with fewer Wi-Fi nodes than previous methods.



Paperid:12
Authors:Haisong Gong, Weizhi Xu, Shu Wu, Qiang Liu, Liang Wang
Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, ByteDance Inc., Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Abstract:
Fact checking aims to predict claim veracity by reasoning over multiple evidence pieces. It usually involves evidence retrieval and veracity reasoning. In this paper, we focus on the latter, reasoning over unstructured text and structured table information. Previous works have primarily relied on finetuning pretrained language models or training homogeneous-graph-based models. Despite their effectiveness, we argue that they fail to explore the rich semantic information underlying the evidence with different structures. To address this, we propose a novel word-level Heterogeneous-graph-based model for Fact Checking over unstructured and structured information, namely HeterFC. Our approach leverages a heterogeneous evidence graph, with words as nodes and thoughtfully designed edges representing different evidence properties. We perform information propagation via a relational graph neural network, facilitating interactions between claims and evidence. An attention-based method is utilized to integrate information, combined with a language model for generating predictions. We introduce a multitask loss function to account for potential inaccuracies in evidence retrieval. Comprehensive experiments on the large fact checking dataset FEVEROUS demonstrate the effectiveness of HeterFC. Code will be released at: https://github.com/Deno-V/HeterFC.



Paperid:13
Authors:Haisong Gong, Qiang Liu, Shu Wu, Liang Wang
Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Abstract:
Textguided molecule generation is a task where molecules are generated to match specific textual descriptions. Recently, most existing SMILES-based molecule generation methods rely on an autoregressive architecture. In this work, we propose the Text-Guided Molecule Generation with Diffusion Language Model (TGM-DLM), a novel approach that leverages diffusion models to address the limitations of autoregressive methods. TGM-DLM updates token embeddings within the SMILES string collectively and iteratively, using a two-phase diffusion generation process. The first phase optimizes embeddings from random noise, guided by the text description, while the second phase corrects invalid SMILES strings to form valid molecular representations. We demonstrate that TGM-DLM outperforms MolT5-Base, an autoregressive model, without the need for additional data resources. Our findings underscore the remarkable effectiveness of TGM-DLM in generating coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. Code will be released at: https://github.com/Deno-V/tgm-dlm.



Paperid:14
Authors:Jiazhi Guan, Yi Zhao, Zhuoer Xu, Changhua Meng, Ke Xu, Youjian Zhao
DCST, BNRist, Tsinghua University, Beijing Institute of Technology, Ant Group, Ant Group, DCST, BNRist, Tsinghua University Zhongguancun Laboratory, DCST, BNRist, Tsinghua University Zhongguancun Laboratory
Abstract:
The nonconsensual exploitation of facial manipulation has emerged as a pressing societal concern. In tandem with the identification of such fake content, recent research endeavors have advocated countering manipulation techniques through proactive interventions, specifically the incorporation of adversarial noise to impede the manipulation in advance. Nevertheless, with insufficient consideration of robustness, we show that current methods falter in providing protection after simple perturbations, e.g., blur. In addition, traditional optimization-based methods face limitations in scalability as they struggle to accommodate the substantial expansion of data volume, a consequence of the time-intensive iterative pipeline. To solve these challenges, we propose a learning-based model, Adversarial Robust Safeguard (ARS), to generate desirable protection noise in a single forward process, concurrently exhibiting a heightened resistance against prevalent perturbations. Specifically, our method involves a two-way protection design, characterized by a basic protection component responsible for generating efficacious noise features, coupled with robust protection for further enhancement. In robust protection, we first fuse image features with spatially duplicated noise embedding, thereby accounting for inherent information redundancy. Subsequently, a combination comprising a differentiable perturbation module and an adversarial network is devised to simulate potential information degradation during the training process. To evaluate it, we conduct experiments on four manipulation methods and compare recent works comprehensively. The results of our method exhibit good visual effects with pronounced robustness against varied perturbations at different levels.



Paperid:15
Authors:Dongyue Guo, Zheng Zhang, Zhen Yan, Jianwei Zhang, Yi Lin
College of Computer Science, Sichuan University, Chengdu 610000, China, College of Computer Science, Sichuan University, Chengdu 610000, China, College of Computer Science, Sichuan University, Chengdu 610000, China, College of Computer Science, Sichuan University, Chengdu 610000, China, College of Computer Science, Sichuan University, Chengdu 610000, China
Abstract:
Flight Trajectory Prediction (FTP) is an essential task in Air Traffic Control (ATC), which can assist air traffic controllers in managing airspace more safely and efficiently. Existing approaches generally perform multihorizon FTP tasks in an autoregressive manner, thereby suffering from error accumulation and low-efficiency problems. In this paper, a novel framework, called FlightBERT++, is proposed to i) forecast multi-horizon flight trajectories directly in a non-autoregressive way, and ii) improve the limitation of the binary encoding (BE) representation in the FlightBERT. Specifically, the FlightBERT++ is implemented by a generalized encoder-decoder architecture, in which the encoder learns the temporal-spatial patterns from historical observations and the decoder predicts the flight status for the future horizons. Compared with conventional architecture, an innovative horizon-aware contexts generator is dedicatedly designed to consider the prior horizon information, which further enables non-autoregressive multi-horizon prediction. Moreover, a differential prompted decoder is proposed to enhance the capability of the differential predictions by leveraging the stationarity of the differential sequence. The experimental results on a real-world dataset demonstrated that the FlightBERT++ outperformed the competitive baselines in both FTP performance and computational efficiency.



Paperid:16
Authors:Hongcheng Guo, Jian Yang, Jiaheng Liu, Jiaqi Bai, Boyang Wang, Zhoujun Li, Tieqiao Zheng, Bo Zhang, Junran Peng, Qi Tian
State Key Lab of Software Development Environment, Beihang University, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China, Cloudwise Research, Beijing, China, Cloudwise Research, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China, Huawei, Beijing, China
Abstract:
Log anomaly detection is a key component in the field of artificial intelligence for IT operations (AIOps). Considering log data of variant domains, retraining the whole network for unknown domains is inefficient in real industrial scenarios. However, previous deep models merely focused on extracting the semantics of log sequences in the same domain, leading to poor generalization on multidomain logs. To alleviate this issue, we propose a unified Transformer-based framework for Log anomaly detection (LogFormer) to improve the generalization ability across different domains, where we establish a two-stage process including the pre-training and adapter-based tuning stage. Specifically, our model is first pre-trained on the source domain to obtain shared semantic knowledge of log data. Then, we transfer such knowledge to the target domain via shared parameters. Besides, the Log-Attention module is proposed to supplement the information ignored by the log-paring. The proposed method is evaluated on three public datasets and one real-world dataset. Experimental results on multiple benchmarks demonstrate the effectiveness of our LogFormer with fewer trainable parameters and lower training costs.



Paperid:17
Authors:Zhi Jin, Sheng Xu, Xiang Zhang, Tianze Ling, Nanqing Dong, Wanli Ouyang, Zhiqiang Gao, Cheng Chang, Siqi Sun
Shanghai Artificial Intelligence Laboratory Department of Computer Science, Soochow University, Shanghai Artificial Intelligence Laboratory Research Institute of Intelligent Complex Systems, Fudan University, University of British Columbia Shanghai Artificial Intelligence Laboratory, National Center for Protein Sciences (Beijing), Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, National Center for Protein Sciences (Beijing), Shanghai Artificial Intelligence Laboratory Research Institute of Intelligent Complex Systems, Fudan University
Abstract:
De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learningbased methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides. In our research, we present ContraNovo, a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides and incorporates the mass information into peptide decoding, aiming to address these intricacies more efficiently. Through rigorous evaluations on two benchmark datasets, ContraNovo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing.



Paperid:18
Authors:Seungjun Lee, TaeiL Oh
Alsemy, South Korea, Alsemy, South Korea
Abstract:
Solving partial differential equations (PDEs) by learning the solution operators has emerged as an attractive alternative to traditional numerical methods. However, implementing such architectures presents two main challenges: flexibility in handling irregular and arbitrary input and output formats and scalability to large discretizations. Most existing architectures are limited by their desired structure or infeasible to scale large inputs and outputs. To address these issues, we introduce an attentionbased model called an inducing point operator transformer (IPOT). Inspired by inducing points methods, IPOT is designed to handle any input function and output query while capturing global interactions in a computationally efficient way. By detaching the inputs/outputs discretizations from the processor with a smaller latent bottleneck, IPOT offers flexibility in processing arbitrary discretizations and scales linearly with the size of inputs/outputs. Our experimental results demonstrate that IPOT achieves strong performances with manageable computational complexity on an extensive range of PDE benchmarks and real-world weather forecasting scenarios, compared to state-of-the-art methods. Our code is publicly available at https://github.com/7tl7qns7ch/IPOT.



Paperid:19
Authors:Tong Li, Zhaoyang Liu, Yanyan Shen, Xue Wang, Haokun Chen, Sen Huang
Shanghai Jiao Tong University, Alibaba Group, Shanghai Jiao Tong University, Alibaba Group, Alibaba Group, Alibaba Group
Abstract:
Stock price forecasting has remained an extremely challenging problem for many decades due to the high volatility of the stock market. Recent efforts have been devoted to modeling complex stock correlations toward joint stock price forecasting. Existing works share a common neural architecture that learns temporal patterns from individual stock series and then mixes up temporal representations to establish stock correlations. However, they only consider timealigned stock correlations stemming from all the input stock features, which suffer from two limitations. First, stock correlations often occur momentarily and in a cross-time manner. Second, the feature effectiveness is dynamic with market variation, which affects both the stock sequential patterns and their correlations. To address the limitations, this paper introduces MASTER, a MArkert-guided Stock TransformER, which models the momentary and cross-time stock correlation and leverages market information for automatic feature selection. MASTER elegantly tackles the complex stock correlation by alternatively engaging in intra-stock and inter-stock information aggregation. Experiments show the superiority of MASTER compared with previous works and visualize the captured realistic stock correlation to provide valuable insights.



Paperid:20
Authors:Yanhong Li, Jack Xu, David Anastasiu
Santa Clara University, Santa Clara, CA, USA, Santa Clara Valley Water District, San Jose, CA, USA, Santa Clara University, Santa Clara, CA, USA
Abstract:
In the hydrology field, time series forecasting is crucial for efficient water resource management, improving flood and drought control and increasing the safety and quality of life for the general population. However, predicting longterm streamflow is a complex task due to the presence of extreme events. It requires the capture of long-range dependencies and the modeling of rare but important extreme values. Existing approaches often struggle to tackle these dual challenges simultaneously. In this paper, we specifically delve into these issues and propose Distance-weighted Auto-regularized Neural network (DAN), a novel extreme-adaptive model for long-range forecasting of stremflow enhanced by polar representation learning. DAN utilizes a distance-weighted multi-loss mechanism and stackable blocks to dynamically refine indicator sequences from exogenous data, while also being able to handle uni-variate time-series by employing Gaussian Mixture probability modeling to improve robustness to severe events. We also introduce Kruskal-Wallis sampling and gate control vectors to handle imbalanced extreme data. On four real-life hydrologic streamflow datasets, we demonstrate that DAN significantly outperforms both state-of-the-art hydrologic time series prediction methods and general methods designed for long-term time series prediction.



Paperid:21
Authors:Yijun Li, Cheuk Hang Leung, Xiangqian Sun, Chaoqun Wang, Yiyan Huang, Xing Yan, Qi Wu, Dongdong Wang, Zhixiang Huang
City University of Hong Kong, City University of Hong Kong, Xi’an Jiaotong Liverpool University, City University of Hong Kong, City University of Hong Kong, Renmin University of China, City University of Hong Kong, JD Digits, JD Digits
Abstract:
Consumer credit services offered by electronic commerce platforms provide customers with convenient loan access during shopping and have the potential to stimulate sales. To understand the causal impact of credit lines on spending, previous studies have employed causal estimators, (e.g., direct regression (DR), inverse propensity weighting (IPW), and double machine learning (DML)) to estimate the treatment effect. However, these estimators do not treat the spending of each individual as a distribution that can capture the range and pattern of amounts spent across different orders. By disregarding the outcome as a distribution, valuable insights embedded within the outcome distribution might be overlooked. This paper thus develops distribution valued estimators which extend from existing real valued DR, IPW, and DML estimators within Rubin’s causal framework. We establish their consistency and apply them to a real dataset from a large electronic commerce platform. Our findings reveal that credit lines generally have a positive impact on spending across all quantiles, but consumers would allocate more to luxuries (higher quantiles) than necessities (lower quantiles) as credit lines increase.



Paperid:22
Authors:Zhengyi Li, Menglu Li, Lida Zhu, Wen Zhang
College of Informatics, Huazhong Agricultural University, Wuhan 430070, China, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan 430070,China
Abstract:
Protein posttranslational modification (PTM) site prediction is a fundamental task in bioinformatics. Several computational methods have been developed to predict PTM sites. However, existing methods ignore the structure information and merely utilize protein sequences. Furthermore, designing a more fine-grained structure representation learning method is urgently needed as PTM is a biological event that occurs at the atom granularity. In this paper, we propose a PTM site prediction method by Coupling of Multi-Granularity structure and Multi-Scale sequence representation, PTM-CMGMS for brevity. Specifically, multigranularity structure-aware representation learning is designed to learn neighborhood structure representations at the amino acid, atom, and whole protein granularity from AlphaFold predicted structures, followed by utilizing contrastive learning to optimize the structure representations. Additionally, multi-scale sequence representation learning is used to extract context sequence information, and motif generated by aligning all context sequences of PTM sites assists the prediction. Extensive experiments on three datasets show that PTM-CMGMS outperforms the state-of-the-art methods. Source code can be found at https://github.com/LZY-HZAU/PTM-CMGMS.



Paperid:23
Authors:Minghui Liao, Guojia Wan, Bo Du
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Institute of Artificial Intelligence, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Institute of Artificial Intelligence, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, China Institute of Artificial Intelligence, Wuhan University, China Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China
Abstract:
Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we are able to obtain wholebrain connectome consisting neuronal high-resolution morphology and connectivity information. However, few models are built based on such data for automated neuron classification. In this paper, we propose NeuNet, a framework that combines morphological information of neurons obtained from skeleton and topological information between neurons obtained from neural circuit. Specifically, NeuNet consists of three components, namely Skeleton Encoder, Connectome Encoder, and Readout Layer. Skeleton Encoder integrates the local information of neurons in a bottom-up manner, with a one-dimensional convolution in neural skeleton's point data; Connectome Encoder uses a graph neural network to capture the topological information of neural circuit; finally, Readout Layer fuses the above two information and outputs classification results. We reprocess and release two new datasets for neuron classification task from volume electron microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments on these two datasets demonstrated the effectiveness of our model with accuracies of 0.9169 and 0.9363, respectively. Code and data are available at: https://github.com/WHUminghui/NeuNet.



Paperid:24
Authors:Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, Wen-Chih Peng
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University
Abstract:
In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationship in microservices when facing system malfunctions. Previous research employed structure learning methods (e.g., PCalgorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increases, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at https://github.com/zmlin1998/RUN.



Paperid:25
Authors:Shengheng Liu, Xingkang Li, Zihuan Mao, Peng Liu, Yongming Huang
National Mobile Communications Research Laboratory, Southeast University, Nanjing, China Purple Mountain Laboratories, Nanjing, China, National Mobile Communications Research Laboratory, Southeast University, Nanjing, China, National Mobile Communications Research Laboratory, Southeast University, Nanjing, China Purple Mountain Laboratories, Nanjing, China, Purple Mountain Laboratories, Nanjing, China, National Mobile Communications Research Laboratory, Southeast University, Nanjing, China Purple Mountain Laboratories, Nanjing, China
Abstract:
Highaccuracy positioning has become a fundamental enabler for intelligent connected devices. Nevertheless, the present wireless networks still rely on model-driven approaches to achieve positioning functionality, which are susceptible to performance degradation in practical scenarios, primarily due to hardware impairments. Integrating artificial intelligence into the positioning framework presents a promising solution to revolutionize the accuracy and robustness of location-based services. In this study, we address this challenge by reformulating the problem of angle-of-arrival (AoA) estimation into image reconstruction of spatial spectrum. To this end, we design a model-driven deep neural network (MoD-DNN), which can automatically calibrate the angular-dependent phase error. The proposed MoD-DNN approach employs an iterative optimization scheme between a convolutional neural network and a sparse conjugate gradient algorithm. Simulation and experimental results are presented to demonstrate the effectiveness of the proposed method in enhancing spectrum calibration and AoA estimation.



Paperid:26
Authors:Jesung Ryu, Seungyeon Rhyu, Hong-Gyu Yoon, Eunchong Kim, Ju Young Yang, Taehyun Kim
Pozalabs, Republic of Korea, Pozalabs, Republic of Korea, Pozalabs, Republic of Korea, Pozalabs, Republic of Korea, Duke University, United States, Pozalabs, Republic of Korea
Abstract:
One of the challenges in generating humanlike music is articulating musical expressions such as dynamics, phrasing, and timbre, which are difficult for computational models to mimic. Previous efforts to tackle this problem have been insufficient due to a fundamental lack of data containing information about musical expressions. In this paper, we introduce MID-FiLD, a MIDI dataset for learning fine-level dynamics control. Notable properties of MID-FiLD are as follows: (1) All 4,422 MIDI samples are constructed by professional music writers with a strong understanding of composition and musical expression. (2) Each MIDI sample contains four different musical metadata and control change \#1 (CC\#1) value. We verify that our metadata is a key factor in MID-FiLD, exerting a substantial influence over produced CC\#1 values. In addition, we demonstrate the applicability of MID-FiLD to deep learning models by suggesting a token-based encoding methodology and reveal the potential for generating controllable, human-like musical expressions.



Paperid:27
Authors:Rui She, Sijie Wang, Qiyu Kang, Kai Zhao, Yang Song, Wee Peng Tay, Tianyu Geng, Xingchao Jian
Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore
Abstract:
Point cloud registration is a crucial technique in 3D computer vision with a wide range of applications. However, this task can be challenging, particularly in large fields of view with dynamic objects, environmental noise, or other perturbations. To address this challenge, we propose a model called PosDiffNet. Our approach performs hierarchical registration based on windowlevel, patch-level, and point-level correspondence. We leverage a graph neural partial differential equation (PDE) based on Beltrami flow to obtain high-dimensional features and position embeddings for point clouds. We incorporate position embeddings into a Transformer module based on a neural ordinary differential equation (ODE) to efficiently represent patches within points. We employ the multi-level correspondence derived from the high feature similarity scores to facilitate alignment between point clouds. Subsequently, we use registration methods such as SVD-based algorithms to predict the transformation using corresponding point pairs. We evaluate PosDiffNet on several 3D point cloud datasets, verifying that it achieves state-of-the-art (SOTA) performance for point cloud registration in large fields of view with perturbations. The implementation code of experiments is available at https://github.com/AI-IT-AVs/PosDiffNet.



Paperid:28
Authors:Wenkang Su, Jiangqun Ni, Yiyan Sun
Sun Yat-Sen University Guangzhou University, Sun Yat-Sen University Peng Cheng Laboratory, Sun Yat-Sen University
Abstract:
The recent advances in generative image steganography have drawn increasing attention due to their potential for provable security and bulk embedding capacity. However, existing generative steganographic schemes are usually tailored for specific tasks and are hardly applied to applications with practical constraints. To address this issue, this paper proposes a generic generative image steganography scheme called Steganography StyleGAN (StegaStyleGAN) that meets the practical objectives of security, capacity, and robustness within the same framework. In StegaStyleGAN, a novel DistributionPreserving Secret Data Modulator (DP-SDM) is used to achieve provably secure generative image steganography by preserving the data distribution of the model inputs. Additionally, a generic and efficient Secret Data Extractor (SDE) is invented for accurate secret data extraction. By choosing whether to incorporate the Image Attack Simulator (IAS) during the training process, one can obtain two models with different parameters but the same structure (both generator and extractor) for lossless and lossy channel covert communication, namely StegaStyleGAN-Ls and StegaStyleGAN-Ly. Furthermore, by mating with GAN inversion, conditional generative steganography can be achieved as well. Experimental results demonstrate that, whether for lossless or lossy communication channels, the proposed StegaStyleGAN can significantly outperform the corresponding state-of-the-art schemes.



Paperid:29
Authors:Xiaorui Su, Pengwei Hu, Zhu-Hong You, Philip S. Yu, Lun Hu
Xinjiang Technical Institute of Physics and Chemistry, Urumqi, China University of Chinese Academy of Sciences, Beijing, China University of Illinois Chicago, Chicago, USA, Xinjiang Technical Institute of Physics and Chemistry, Urumqi, China University of Chinese Academy of Sciences, Beijing, China, Guangxi Key Lab of Human-machine Interaction and Intelligent Decision, Nanning, China, University of Illinois Chicago, Chicago, USA, Xinjiang Technical Institute of Physics and Chemistry, Urumqi, China University of Chinese Academy of Sciences, Beijing, China
Abstract:
Identifying novel drugdrug interactions (DDIs) is a crucial task in pharmacology, as the interference between pharmacological substances can pose serious medical risks. In recent years, several network-based techniques have emerged for predicting DDIs. However, they primarily focus on local structures within DDI-related networks, often overlooking the significance of indirect connections between pairwise drug nodes from a global perspective. Additionally, effectively handling heterogeneous information present in both biomedical knowledge graphs and drug molecular graphs remains a challenge for improved performance of DDI prediction. To address these limitations, we propose a Transformer-based relatIon-aware Graph rEpresentation leaRning framework (TIGER) for DDI prediction. TIGER leverages the Transformer architecture to effectively exploit the structure of heterogeneous graph, which allows it direct learning of long dependencies and high-order structures. Furthermore, TIGER incorporates a relation-aware self-attention mechanism, capturing a diverse range of semantic relations that exist between pairs of nodes in heterogeneous graph. In addition to these advancements, TIGER enhances predictive accuracy by modeling DDI prediction task using a dual-channel network, where drug molecular graph and biomedical knowledge graph are fed into two respective channels. By incorporating embeddings obtained at graph and node levels, TIGER can benefit from structural properties of drugs as well as rich contextual information provided by biomedical knowledge graph. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness of TIGER in DDI prediction. Furthermore, case studies highlight its ability to provide a deeper understanding of underlying mechanisms of DDIs.



Paperid:30
Authors:Sally Turutov, Kira Radinsky
Technion - Israel Institute of Technology, Technion - Israel Institute of Technology
Abstract:
In drug development, molecular optimization is a crucial challenge that involves generating novel molecules given a lead molecule as input. The task requires maintaining molecular similarity to the original molecule while simultaneously optimizing multiple chemical attributes. To aid in this process, numerous generative models have been proposed. However, in practical applications, it is crucial for these models not only to generate novel molecules with the above constraints but also to generate molecules that significantly differ from any existing patented compounds. In this work, we present a multioptimization molecular framework to address this challenge. Our framework trains a model to prioritize both enhanced properties and substantial dissimilarity from patented compounds. By jointly learning continuous representations of optimized and patentable molecules, we ensure that the generated molecules are significantly distant from any patented compounds while improving chemical properties. Through empirical evaluation, we demonstrate the superior performance of our approach compared to state-of-the-art molecular optimization methods both in chemical property optimization and patentability.



Paperid:31
Authors:Jiquan Wang, Sha Zhao, Haiteng Jiang, Shijian Li, Tao Li, Gang Pan
State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China College of Computer Science and Technology, Zhejiang University, Hangzhou, China, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China College of Computer Science and Technology, Zhejiang University, Hangzhou, China, Department of Neurobiology, Affiliated Mental Health Center & Hangzhou Seventh People's Hospital, Zhejiang University School of Medicine, Hangzhou, China MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University, Hangzhou, China State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China College of Computer Science and Technology, Zhejiang University, Hangzhou, China, Department of Neurobiology, Affiliated Mental Health Center & Hangzhou Seventh People's Hospital, Zhejiang University School of Medicine, Hangzhou, China MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University, Hangzhou, China State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, Hangzhou, China College of Computer Science and Technology, Zhejiang University, Hangzhou, China MOE Frontier Science Center for Brain Science and Brain-machine Integration, Zhejiang University, Hangzhou, China
Abstract:
Automatic sleep staging is essential for sleep assessment and disorder diagnosis. Most existing methods depend on one specific dataset and are limited to be generalized to other unseen datasets, for which the training data and testing data are from the same dataset. In this paper, we introduce domain generalization into automatic sleep staging and propose the task of generalizable sleep staging which aims to improve the model generalization ability to unseen datasets. Inspired by existing domain generalization methods, we adopt the feature alignment idea and propose a framework called SleepDG to solve it. Considering both of local salient features and sequential features are important for sleep staging, we propose a Multilevel Feature Alignment combining epoch-level and sequence-level feature alignment to learn domain-invariant feature representations. Specifically, we design an Epoch-level Feature Alignment to align the feature distribution of each single sleep epoch among different domains, and a Sequence-level Feature Alignment to minimize the discrepancy of sequential features among different domains. SleepDG is validated on five public datasets, achieving the state-of-the-art performance.



Paperid:32
Authors:Tong Wang, Yuan Yao, Feng Xu, Miao Xu, Shengwei An, Ting Wang
Nanjing University, Nanjing University, Nanjing University, University of Queensland, Purdue University, Stony Brook University
Abstract:
Backdoor attacks have been shown to be a serious security threat against deep learning models, and various defenses have been proposed to detect whether a model is backdoored or not. However, as indicated by a recent blackbox attack, existing defenses can be easily bypassed by implanting the backdoor in the frequency domain. To this end, we propose a new defense DTInspector against black-box backdoor attacks, based on a new observation related to the prediction confidence of learning models. That is, to achieve a high attack success rate with a small amount of poisoned data, backdoor attacks usually render a model exhibiting statistically higher prediction confidences on the poisoned samples. We provide both theoretical and empirical evidence for the generality of this observation. DTInspector then carefully examines the prediction confidences of data samples, and decides the existence of backdoor using the shortcut nature of backdoor triggers. Extensive evaluations on six backdoor attacks, four datasets, and three advanced attacking types demonstrate the effectiveness of the proposed defense.



Paperid:33
Authors:Yingheng Wang, Shufeng Kong, John M. Gregoire, Carla P. Gomes
Department of Computer Science, Cornell University, USA, Department of Computer Science, Cornell University, USA School of Software Engineering, Sun Yat-sen University, China, Liquid Sunlight Alliance, California Institute of Technology, USA, Department of Computer Science, Cornell University, USA
Abstract:
Machine learning techniques, especially in the realm of materials design, hold immense promise in predicting the properties of crystal materials and aiding in the discovery of novel crystals with desirable traits. However, crystals possess unique geometric constraints—namely, E(3) invariance for primitive cell and periodic invariance—which need to be accurately reflected in crystal representations. Though past research has explored various construction techniques to preserve periodic invariance in crystal representations, their robustness remains inadequate. Furthermore, effectively capturing angular information within 3D crystal structures continues to pose a significant challenge for graphbased approaches. This study introduces novel solutions to these challenges. We first present a graph construction method that robustly encodes periodic invariance and a strategy to capture angular information in neural networks without compromising efficiency. We further introduce CrystalFormer, a pioneering graph transformer architecture that emphasizes angle preservation and enhances long-range information. Through comprehensive evaluation, we verify our model's superior performance in 5 crystal prediction tasks, reaffirming the efficiency of our proposed methods.



Paperid:34
Authors:Yu Wang, Xiaoye Wang, Zaiwang Gu, Weide Liu, Wee Siong Ng, Weimin Huang, Jun Cheng
Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore, Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore Department of Mathematics, Harbin Institute of Technology, Weihai, China, Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore, Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore, Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore, Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore, Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore
Abstract:
Keypointsbased approaches have shown to be promising for retinal image registration, which superimpose two or more images from different views based on keypoint detection and description. However, existing approaches suffer from ineffective keypoint detector and descriptor training. Meanwhile, the non-linear mapping from 3D retinal structure to 2D images is often neglected. In this paper, we propose a novel learning-based junction detection approach for retinal image registration, which enhances both the keypoint detector and descriptor training. To improve the keypoint detection, it uses a multi-task vessel detection to regularize the model training, which helps to learn more representative features and reduce the risk of over-fitting. To achieve effective training for keypoints description, a new constrained negative sampling approach is proposed to compute the descriptor loss. Moreover, we also consider the non-linearity between retinal images from different views during matching. Experimental results on FIRE dataset show that our method achieves mean area under curve of 0.850, which is 12.6% higher than 0.755 by the state-of-the-art method. All the codes are available at https://github.com/samjcheng/SuperJunction.



Paperid:35
Authors:Zilin Wang, Haolin Zhuang, Lu Li, Yinmin Zhang, Junjie Zhong, Jun Chen, Yu Yang, Boshi Tang, Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, The University of Sydney, Waseda University, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University
Abstract:
This paper presents an Exploratory 3D Dance generation framework, E3D2, designed to address the exploration capability deficiency in existing musicconditioned 3D dance generation models. Current models often generate monotonous and simplistic dance sequences that misalign with human preferences because they lack exploration capabilities.The E3D2 framework involves a reward model trained from automatically-ranked dance demonstrations, which then guides the reinforcement learning process. This approach encourages the agent to explore and generate high quality and diverse dance movement sequences. The soundness of the reward model is both theoretically and experimentally validated. Empirical experiments demonstrate the effectiveness of E3D2 on the AIST++ dataset.



Paperid:36
Authors:Lirong Wu, Yufei Huang, Cheng Tan, Zhangyang Gao, Bozhen Hu, Haitao Lin, Zicheng Liu, Stan Z. Li
Westlake University Zhejiang University, Westlake University Zhejiang University, Westlake University Zhejiang University, Westlake University Zhejiang University, Westlake University Zhejiang University, Westlake University Zhejiang University, Westlake University Zhejiang University, Westlake University
Abstract:
CompoundProtein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world scenarios due to various factors, e.g., modality missing and domain shifting. More importantly, these methods only model protein sequences and structures at a single fixed scale, neglecting more fine-grained multi-scale information, such as those embedded in key protein fragments. In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction (PSC-CPI), which captures the dependencies between protein sequences and structures through both intra-modality and cross-modality contrasting. We further apply length-variable protein augmentation to allow contrasting to be performed at different scales, from the amino acid level to the sequence level. Finally, in order to more fairly evaluate the model generalizability, we split the test data into four settings based on whether compounds and proteins have been observed during the training stage. Extensive experiments have shown that PSC-CPI generalizes well in all four settings, particularly in the more challenging ``Unseen-Both" setting, where neither compounds nor proteins have been observed during training. Furthermore, even when encountering a situation of modality missing, i.e., inference with only single-modality protein data, PSC-CPI still exhibits comparable or even better performance than previous approaches.



Paperid:37
Authors:Tailin Wu, Willie Neiswanger, Hongtao Zheng, Stefano Ermon, Jure Leskovec
Westlake University, University of Southern California, Westlake University, Stanford University, Stanford University
Abstract:
Deep learningbased surrogate models have demonstrated remarkable advantages over classical solvers in terms of speed, often achieving speedups of 10 to 1000 times over traditional partial differential equation (PDE) solvers. However, a significant challenge hindering their widespread adoption in both scientific and industrial domains is the lack of understanding about their prediction uncertainties, particularly in scenarios that involve critical decision making. To address this limitation, we propose a method that integrates efficient and precise uncertainty quantification into a deep learning-based surrogate model. Our method, termed Latent Evolution of PDEs with Uncertainty Quantification (LE-PDE-UQ), endows deep learning-based surrogate models with robust and efficient uncertainty quantification capabilities for both forward and inverse problems. LE-PDE-UQ leverages latent vectors within a latent space to evolve both the system's state and its corresponding uncertainty estimation. The latent vectors are decoded to provide predictions for the system's state as well as estimates of its uncertainty. In extensive experiments, we demonstrate the accurate uncertainty quantification performance of our approach, surpassing that of strong baselines including deep ensembles, Bayesian neural network layers, and dropout. Our method excels at propagating uncertainty over extended auto-regressive rollouts, making it suitable for scenarios involving long-term predictions. Our code is available at: https://github.com/AI4Science-WestlakeU/le-pde-uq.



Paperid:38
Authors:Zhousan Xie, Shikui Tu, Lei Xu
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University Guangdong Institute of Intelligence Science and Technology
Abstract:
Prediction of drugtarget interactions (DTIs) is a crucial step in drug discovery, and deep learning methods have shown great promise on various DTI datasets. However, existing approaches still face several challenges, including limited labeled data, hidden bias issue, and a lack of generalization ability to out-of-domain data. These challenges hinder the model's capacity to learn truly informative interaction features, leading to shortcut learning and inferior predictive performance on novel drug-target pairs. To address these issues, we propose MlanDTI, a semi-supervised domain adaptive multilevel attention network (Mlan) for DTI prediction. We utilize two pre-trained BERT models to acquire bidirectional representations enriched with information from unlabeled data. Then, we introduce a multilevel attention mechanism, enabling the model to learn domain-invariant DTIs at different hierarchical levels. Moreover, we present a simple yet effective semi-supervised pseudo-labeling method to further enhance our model's predictive ability in cross-domain scenarios. Experiments on four datasets show that MlanDTI achieves state-of-the-art performances over other methods under intra-domain settings and outperforms all other approaches under cross-domain settings. The source code is available at https://github.com/CMACH508/MlanDTI.



Paperid:39
Authors:Can Xu, Haosen Wang, Weigang Wang, Pengfei Zheng, Hongyang Chen
Zhejiang Gongshang University Zhejiang Lab, Southeast University Zhejiang Lab, Zhejiang Gongshang University, Zhejiang Lab, Zhejiang Lab
Abstract:
Denoising diffusion models have shown great potential in multiple research areas. Existing diffusionbased generative methods on de novo 3D molecule generation face two major challenges. Since majority heavy atoms in molecules allow connections to multiple atoms through single bonds, solely using pair-wise distance to model molecule geometries is insufficient. Therefore, the first one involves proposing an effective neural network as the denoising kernel that is capable to capture complex multi-body interatomic relationships and learn high-quality features. Due to the discrete nature of graphs, mainstream diffusion-based methods for molecules heavily rely on predefined rules and generate edges in an indirect manner. The second challenge involves accommodating molecule generation to diffusion and accurately predicting the existence of bonds. In our research, we view the iterative way of updating molecule conformations in diffusion process is consistent with molecular dynamics and introduce a novel molecule generation method named Geometric-Facilitated Molecular Diffusion (GFMDiff). For the first challenge, we introduce a Dual-track Transformer Network (DTN) to fully excevate global spatial relationships and learn high quality representations which contribute to accurate predictions of features and geometries. As for the second challenge, we design Geometric-facilitated Loss (GFLoss) which intervenes the formation of bonds during the training period, instead of directly embedding edges into the latent space. Comprehensive experiments on current benchmarks demonstrate the superiority of GFMDiff.



Paperid:40
Authors:Shu Yin, Peican Zhu, Lianwei Wu, Chao Gao, Zhen Wang
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
With the rise of social media, the spread of fake news has become a significant concern, potentially misleading public perceptions and impacting social stability. Although deep learning methods like CNNs, RNNs, and Transformerbased models like BERT have enhanced fake news detection. However, they primarily focus on content and do not consider social context during news propagation. Graph-based techniques have incorporated the social context but are limited by the need for large labeled datasets. To address these challenges, this paper introduces GAMC, an unsupervised fake news detection technique using the Graph Autoencoder with Masking and Contrastive learning. By leveraging both the context and content of news propagation as self-supervised signals, our method reduces the dependency on labeled datasets. Specifically, GAMC begins by applying data augmentation to the original news propagation graphs. Subsequently, these augmented graphs are encoded using a graph encoder and subsequently reconstructed via a graph decoder. Finally, a composite loss function that encompasses both reconstruction error and contrastive loss is designed. Firstly, it ensures the model can effectively capture the latent features, based on minimizing the discrepancy between reconstructed and original graph representations. Secondly, it aligns the representations of augmented graphs that originate from the same source. Experiments on the real-world dataset validate the effectiveness of our method.



Paperid:41
Authors:Jixiang Yu, Nanjun Chen, Ming Gao, Xiangtao Li, Ka-Chun Wong
Department of Computer Science, City University of Hong Kong, Department of Computer Science, City University of Hong Kong, School of Management Science and Engineering, Key Laboratory of Big Data Management Optimization and Decision of Liaoning Province, Dongbei University of Finance and Economics Center for Post-doctoral Studies of Computer Science, Northeastern University, School of Artificial Intelligence, Jilin University, Department of Computer Science, City University of Hong Kong Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong SAR
Abstract:
Cell type identification plays a vital role in singlecell RNA sequencing (scRNA-seq) data analysis. Although many deep embedded methods to cluster scRNA-seq data have been proposed, they still fail in elucidating the intrinsic properties of cells and genes. Here, we present a novel end-to-end deep graph clustering model for single-cell transcriptomics data based on unsupervised Gene-Cell Collective representation learning and Optimal Transport (scGCOT) which integrates both cell and gene correlations. Specifically, scGCOT learns the latent embedding of cells and genes simultaneously and reconstructs the cell graph, the gene graph, and the gene expression count matrix. A zero-inflated negative binomial (ZINB) model is estimated via the reconstructed count matrix to capture the essential properties of scRNA-seq data. By leveraging the optimal transport-based joint representation alignment, scGCOT learns the clustering process and the latent representations through a mutually supervised self optimization strategy. Extensive experiments with 14 competing methods on 15 real scRNA-seq datasets demonstrate the competitive edges of scGCOT.



Paperid:42
Authors:Shuai Yu
Donghua University, Shanghai, China
Abstract:
Singing melody extraction is an important task in the field of music information retrieval (MIR). The development of datadriven models for this task have achieved great successes. However, the existing models have two major limitations: firstly, most of the existing singing melody extraction models have formulated this task as a pixel-level prediction task. The lack of labeling data has limited the model for further improvements. Secondly, the generalization of the existing models are prone to be disturbed by the music genres. To address the issues mentioned above, in this paper, we propose a multi-Task contrastive learning framework for semi-supervised singing melody extraction, termed as MCSSME. Specifically, to deal with data scarcity limitation, we propose a self-consistency regularization (SCR) method to train the model on the unlabeled data. Transformations are applied to the raw signal of polyphonic music, which makes the network to improve its representation capability via recognizing the transformations. We further propose a novel multi-task learning (MTL) approach to jointly learn singing melody extraction and classification of transformed data. To deal with generalization limitation, we also propose a contrastive embedding learning, which strengthens the intra-class compactness and inter-class separability. To improve the generalization on different music genres, we also propose a domain classification method to learn task-dependent features by mapping data from different music genres to shared subspace. MCSSME evaluates on a set of well-known public melody extraction datasets with promising performances. The experimental results demonstrate the effectiveness of the MCSSME framework for singing melody extraction from polyphonic music using very limited labeled data scenarios.



Paperid:43
Authors:Yemin Yu, Luotian Yuan, Ying Wei, Hanyu Gao, Fei Wu, Zhihua Wang, Xinhai Ye
City University of Hong Kong Shanghai Institute for Advanced Study of Zhejiang University, China, Zhejiang University, China, Nanyang Technology University, Hong Kong University of Science and Technology, Shanghai Institute for Advanced Study of Zhejiang University, China Zhejiang University, China, Shanghai Institute for Advanced Study of Zhejiang University, China, Zhejiang University
Abstract:
Machine learningassisted retrosynthesis prediction models have been gaining widespread adoption, though their performances oftentimes degrade significantly when deployed in real-world applications embracing out-of-distribution (OOD) molecules or reactions. Despite steady progress on standard benchmarks, our understanding of existing retrosynthesis prediction models under the premise of distribution shifts remains stagnant. To this end, we first formally sort out two types of distribution shifts in retrosynthesis prediction and construct two groups of benchmark datasets. Next, through comprehensive experiments, we systematically compare state-of-the-art retrosynthesis prediction models on the two groups of benchmarks, revealing the limitations of previous in-distribution evaluation and re-examining the advantages of each model. More remarkably, we are motivated by the above empirical insights to propose two model-agnostic techniques that can improve the OOD generalization of arbitrary off-the-shelf retrosynthesis prediction algorithms. Our preliminary experiments show their high potential with an average performance improvement of 4.6%, and the established benchmarks serve as a foothold for further retrosynthesis prediction research towards OOD generalization.



Paperid:44
Authors:Xi Zeng, Xiaotian Hao, Hongyao Tang, Zhentao Tang, Shaoqing Jiao, Dazhi Lu, Jiajie Peng
School of Computer Science, Northwestern Polytechnical University, College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University, Noah’s Ark Lab, Huawei, School of Computer Science, Northwestern Polytechnical University, School of Computer Science, Northwestern Polytechnical University, School of Computer Science, Northwestern Polytechnical University Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology School of Computer Science, Research and Development Institute of Northwestern Polytechnical University in Shenzhen
Abstract:
Designing novel biological sequences with desired properties is a significant challenge in biological science because of the extra large search space. The traditional design process usually involves multiple rounds of costly wet lab evaluations. To reduce the need for expensive wet lab experiments, machine learning methods are used to aid in designing biological sequences. However, the limited availability of biological sequences with known properties hinders the training of machine learning models, significantly restricting their applicability and performance. To fill this gap, we present ERLBioSeq, an Evolutionary Reinforcement Learning algorithm for BIOlogical SEQuence design. ERLBioSeq leverages the capability of reinforcement learning to learn without prior knowledge and the potential of evolutionary algorithms to enhance the exploration of reinforcement learning in the large search space of biological sequences. Additionally, to enhance the efficiency of biological sequence design, we developed a predictor for sequence screening in the biological sequence design process, which incorporates both the local and global sequence information. We evaluated the proposed method on three main types of biological sequence design tasks, including the design of DNA, RNA, and protein. The results demonstrate that the proposed method achieves significant improvement compared to the existing stateof-the-art methods.



Paperid:45
Authors:Xianghua Zeng, Hao Peng, Angsheng Li
Beihang University, Beijing, China, Beihang University, Beijing, China, Beihang University Zhongguancun Laboratory, Beijing, China
Abstract:
The importance of effective detection is underscored by the fact that socialbots imitate human behavior to propagate misinformation, leading to an ongoing competition between socialbots and detectors. Despite the rapid advancement of reactive detectors, the exploration of adversarial socialbot modeling remains incomplete, significantly hindering the development of proactive detectors. To address this issue, we propose a mathematical Structural Information principlesbased Adversarial Socialbots Modeling framework, namely SIASM, to enable more accurate and effective modeling of adversarial behaviors. First, a heterogeneous graph is presented to integrate various users and rich activities in the original social network and measure its dynamic uncertainty as structural entropy. By minimizing the high-dimensional structural entropy, a hierarchical community structure of the social network is generated and referred to as the optimal encoding tree. Secondly, a novel method is designed to quantify influence by utilizing the assigned structural entropy, which helps reduce the computational cost of SIASM by filtering out uninfluential users. Besides, a new conditional structural entropy is defined between the socialbot and other users to guide the follower selection for network influence maximization. Extensive and comparative experiments on both homogeneous and heterogeneous social networks demonstrate that, compared with state-of-the-art baselines, the proposed SIASM framework yields substantial performance improvements in terms of network influence (up to 16.32%) and sustainable stealthiness (up to 16.29%) when evaluated against a robust detector with 90% accuracy.



Paperid:46
Authors:Hongbo Zhang, Guang Wang, Xu Wang, Zhengyang Zhou, Chen Zhang, Zheng Dong, Yang Wang
University of Science and Technology of China, Florida State University, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Wayne State University, University of Science and Technology of China
Abstract:
One of the most important tasks in ridehailing is order dispatching, i.e., assigning unserved orders to available drivers. Recent order dispatching has achieved a significant improvement due to the advance of reinforcement learning, which has been approved to be able to effectively address sequential decision-making problems like order dispatching. However, most existing reinforcement learning methods require agents to learn the optimal policy by interacting with environments online, which is challenging or impractical for real-world deployment due to high costs or safety concerns. For example, due to the spatiotemporally unbalanced supply and demand, online reinforcement learning-based order dispatching may significantly impact the revenue of the ride-hailing platform and passenger experience during the policy learning period. Hence, in this work, we develop an offline deep reinforcement learning framework called NondBREM for large-scale order dispatching, which learns policy from only the accumulated logged data to avoid costly and unsafe interactions with the environment. In NondBREM, a Nondeterministic Batch-Constrained Q-learning (NondBCQ) module is developed to reduce the algorithm extrapolation error and a Random Ensemble Mixture (REM) module that integrates multiple value networks with multi-head networks is utilized to improve the model generalization and robustness. Extensive experiments on large-scale real-world ride-hailing datasets show the superiority of our design.



Paperid:47
Authors:Jialu Zhang, Xiaoying Yang, Wentao He, Jianfeng Ren, Qian Zhang, Yitian Zhao, Ruibin Bai, Xiangjian He, Jiang Liu
The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Cixi Institute of Biomedical Engineering, Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China, Cixi Institute of Biomedical Engineering, Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China, The Digital Port Technologies Lab, School of Computer Science, University of Nottingham Ningbo China Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute, University of Nottingham Ningbo China, Cixi Institute of Biomedical Engineering, Ningbo Institute of Industrial Technology, Chinese Academy of Sciences Department of Computer Science and Engineering, Southern University of Science and Technology
Abstract:
Object detection in aerial imagery presents a significant challenge due to large scale variations among objects. This paper proposes an evolutionary reinforcement learning agent, integrated within a coarseto-fine object detection framework, to optimize the scale for more effective detection of objects in such images. Specifically, a set of patches potentially containing objects are first generated. A set of rewards measuring the localization accuracy, the accuracy of predicted labels, and the scale consistency among nearby patches are designed in the agent to guide the scale optimization. The proposed scale-consistency reward ensures similar scales for neighboring objects of the same category. Furthermore, a spatial-semantic attention mechanism is designed to exploit the spatial semantic relations between patches. The agent employs the proximal policy optimization strategy in conjunction with the evolutionary strategy, effectively utilizing both the current patch status and historical experience embedded in the agent. The proposed model is compared with state-of-the-art methods on two benchmark datasets for object detection on drone imagery. It significantly outperforms all the compared methods. Code is available at https://github.com/UNNC-CV/EvOD/.



Paperid:48
Authors:Rui-Xiao Zhang, Tianchi Huang
The Unversity of Hong Kong, Sony Group Corporation
Abstract:
Learningbased adaptive bitrate (ABR) algorithms have revolutionized video streaming solutions. With the growing demand for data privacy and the rapid development of mobile devices, federated learning (FL) has emerged as a popular training method for neural ABR algorithms in both academia and industry. However, we have discovered that FL-based ABR models are vulnerable to model-poisoning attacks as local updates remain unseen during global aggregation. In response, we propose MAFL (Malicious ABR model based on Federated Learning) to prove that backdooring the learning-based ABR model via FL is practical. Instead of attacking the global policy, MAFL only targets a single ``target client''. Moreover, the unique challenges brought by deep reinforcement learning (DRL) make the attack even more challenging. To address these challenges, MAFL is designed with a two-stage attacking mechanism. Using two representative attack cases with real-world traces, we show that MAFL significantly degrades the model performance on the target client (i.e., increasing rebuffering penalty by 2x and 5x) with a minimal negative impact on benign clients.



Paperid:49
Authors:Jian Zhu, Congcong Liu, Xue Jiang, Changping Peng, Zhangang Lin, Jingping Shao
JD.com, JD.com, JD.com, JD.com, JD.com, JD.com
Abstract:
Deep neural networks (DNNs) have achieved significant advancements in clickthrough rate (CTR) prediction by demonstrating strong generalization on training data. However, in real-world scenarios, the assumption of independent and identically distributed (i.i.d.) conditions, which is fundamental to this problem, is often violated due to temporal distribution shifts. This violation can lead to suboptimal model performance when optimizing empirical risk without access to future data, resulting in overfitting on the training data and convergence to a single sharp minimum. To address this challenge, we propose a novel model updating framework called Slow and Fast Trajectory Learning (SFTL) network. SFTL aims to mitigate the discrepancy between past and future domains while quickly adapting to recent changes in small temporal drifts. This mechanism entails two interactions among three complementary learners: (i) the Working Learner, which updates model parameters using modern optimizers (e.g., Adam, Adagrad) and serves as the primary learner in the recommendation system, (ii) the Slow Learner, which is updated in each temporal domain by directly assigning the model weights of the working learner, and (iii) the Fast Learner, which is updated in each iteration by assigning exponentially moving average weights of the working learner. Additionally, we propose a novel rank-based trajectory loss to facilitate interaction between the working learner and trajectory learner, aiming to adapt to temporal drift and enhance performance in the current domain compared to the past. We provide theoretical understanding and conduct extensive experiments on real-world CTR prediction datasets to validate the effectiveness and efficiency of SFTL in terms of both convergence speed and model performance. The results demonstrate the superiority of SFTL over existing approaches.



Paperid:50
Authors:Yuqi Zhu, Jia Li, Ge Li, YunFei Zhao, Jia Li, Zhi Jin, Hong Mei
Academy of Military Sciences, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University, Beijing, China Advanced Institute of Big Data, Beijing, China
Abstract:
Recently, Large Language Models (LLMs) have shown impressive abilities in code generation. However, existing LLMs' decoding strategies are designed for Natural Language (NL) generation, overlooking the differences between NL and programming languages (PL). Due to this oversight, a better decoding strategy for code generation remains an open question. In this paper, we conduct the first systematic study to explore a decoding strategy specialized in code generation. With an analysis of loss distributions of code tokens, we find that code tokens can be divided into two categories: challenging tokens that are difficult to predict and confident tokens that can be easily inferred. Among them, the challenging tokens mainly appear at the beginning of a code block. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling, which dynamically adjusts the temperature coefficient when decoding different tokens. We apply a larger temperature when sampling for challenging tokens, allowing LLMs to explore diverse choices. We employ a smaller temperature for confident tokens avoiding the influence of tail randomness noises. We apply AdapT sampling to LLMs with different sizes and conduct evaluations on two popular datasets. Results show that AdapT sampling significantly outperforms stateof-the-art decoding strategy.



Paperid:51
Authors:Paul M. Bodily, Dan Ventura
Idaho State University, Brigham Young University
Abstract:
We address the problem of building and evaluating a computational system whose primary objective is creativity. We illustrate seven characteristics for computational creativity in the context of a system that autonomously composes Western lyrical music. We conduct an external evaluation of the system in which respondents rated the system with regard to each characteristic as well as with regard to overall creativity. Average scores for overall creativity exceeded the ratings for any single characteristic, suggesting that creativity may be an emergent property and that unique research opportunities exist for building CC systems whose design attempts to comprehend all known characteristics of creativity.



Paperid:52
Authors:Matteo Bortoletto, Lei Shi, Andreas Bulling
University of Stuttgart, University of Stuttgart, University of Stuttgart
Abstract:
We propose the Intuitive Reasoning Network (IRENE) a novel neural model for intuitive psychological reasoning about agents' goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks - with up to 48.9% improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.



Paperid:53
Authors:Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, Min Zhang
Soochow University, Soochow University, Soochow University, Wuhan University, Harbin Institute of Technology, Shenzhen
Abstract:
Textbased Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.



Paperid:54
Authors:Hongyi Chen, Jingtao Ding, Yong Li, Yue Wang, Xiao-Ping Zhang
Tsinghua University Peng Cheng Laboratory, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
Crowd simulation holds crucial applications in various domains, such as urban planning, architectural design, and traffic arrangement. In recent years, physicsinformed machine learning methods have achieved state-of-the-art performance in crowd simulation but fail to model the heterogeneity and multi-modality of human movement comprehensively. In this paper, we propose a social physics-informed diffusion model named SPDiff to mitigate the above gap. SPDiff takes both the interactive and historical information of crowds in the current timeframe to reverse the diffusion process, thereby generating the distribution of pedestrian movement in the subsequent timeframe. Inspired by the well-known social physics model, i.e., Social Force, regarding crowd dynamics, we design a crowd interaction encoder to guide the denoising process and further enhance this module with the equivariant properties of crowd interactions. To mitigate error accumulation in long-term simulations, we propose a multi-frame rollout training algorithm for diffusion modeling. Experiments conducted on two real-world datasets demonstrate the superior performance of SPDiff in terms of both macroscopic and microscopic evaluation metrics. Code and appendix are available at https://github.com/tsinghua-fib-lab/SPDiff.



Paperid:55
Authors:Yingjie Chen, Jiarui Zhang, Tao Wang, Yun Liang
Peking University, Peking University, Peking University, Peking University
Abstract:
With the increasing need for facial behavior analysis, semisupervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose Trend-AwareSupervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference.



Paperid:56
Authors:Jianhao Ding, Zhaofei Yu, Tiejun Huang, Jian K. Liu
Peking University, Peking University, Peking University, University of Birmingham
Abstract:
Spiking neural networks (SNNs) exploit neural spikes to provide solutions for lowpower intelligent applications on neuromorphic hardware. Although SNNs have high computational efficiency due to spiking communication, they still lack resistance to adversarial attacks and noise perturbations. In the brain, neuronal responses generally possess stochasticity induced by ion channels and synapses, while the role of stochasticity in computing tasks is poorly understood. Inspired by this, we elaborate a stochastic gating spiking neural model for layer-by-layer spike communication, introducing stochasticity to SNNs. Through theoretical analysis, our gating model can be viewed as a regularizer that prevents error amplification under attacks. Meanwhile, our work can explain the robustness of Poisson coding. Experimental results prove that our method can be used alone or with existing robust enhancement algorithms to improve SNN robustness and reduce SNN energy consumption. We hope our work will shed new light on the role of stochasticity in the computation of SNNs. Our code is available at https://github.com/DingJianhao/StoG-meets-SNN/.



Paperid:57
Authors:Hen Emuna, Nadav Borenstein, Xin Qian, Hyeonsu Kang, Joel Chan, Aniket Kittur, Dafna Shahaf
The Hebrew University of Jerusalem, University of Copenhagen, University of Maryland, Carnegie Mellon University, University of Maryland, Carnegie Mellon University, The Hebrew University of Jerusalem
Abstract:
Biologically Inspired Design (BID), or Biomimicry, is a problemsolving methodology that applies analogies from nature to solve engineering challenges. For example, Speedo engineers designed swimsuits based on shark skin. Finding relevant biological solutions for real-world problems poses significant challenges, both due to the limited biological knowledge engineers and designers typically possess and to the limited BID resources. Existing BID datasets are hand-curated and small, and scaling them up requires costly human annotations. In this paper, we introduce BARcode (Biological Analogy Retriever), a search engine for automatically mining bio-inspirations from the web at scale. Using advances in natural language understanding and data programming, BARcode identifies potential inspirations for engineering challenges. Our experiments demonstrate that BARcode can retrieve inspirations that are valuable to engineers and designers tackling real-world problems, as well as recover famous historical BID examples. We release data and code; we view BARcode as a step towards addressing the challenges that have historically hindered the practical application of BID to engineering innovation.



Paperid:58
Authors:Xiang He, Dongcheng Zhao, Yang Li, Guobin Shen, Qingqun Kong, Yi Zeng
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China, Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Future Technology, University of Chinese Academy of Sciences, Beijing, China, Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China School of Future Technology, University of Chinese Academy of Sciences, Beijing, China, Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China School of Future Technology, University of Chinese Academy of Sciences, Beijing, China Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China
Abstract:
Spiking neural networks (SNNs) are rich in spatiotemporal dynamics and are suitable for processing event-based neuromorphic data. However, event-based datasets are usually less annotated than static datasets. This small data scale makes SNNs prone to overfitting and limits their performance. In order to improve the generalization ability of SNNs on event-based datasets, we use static images to assist SNN training on event data. In this paper, we first discuss the domain mismatch problem encountered when directly transferring networks trained on static datasets to event data. We argue that the inconsistency of feature distributions becomes a major factor hindering the effective transfer of knowledge from static images to event data. To address this problem, we propose solutions in terms of two aspects: feature distribution and training strategy. Firstly, we propose a knowledge transfer loss, which consists of domain alignment loss and spatio-temporal regularization. The domain alignment loss learns domain-invariant spatial features by reducing the marginal distribution distance between the static image and the event data. Spatio-temporal regularization provides dynamically learnable coefficients for domain alignment loss by using the output features of the event data at each time step as a regularization term. In addition, we propose a sliding training strategy, which gradually replaces static image inputs probabilistically with event data, resulting in a smoother and more stable training for the network. We validate our method on neuromorphic datasets, including N-Caltech101, CEP-DVS, and N-Omniglot. The experimental results show that our proposed method achieves better performance on all datasets compared to the current state-of-the-art methods. Code is available at https://github.com/Brain-Cog-Lab/Transfer-for-DVS.



Paperid:59
Authors:Zhejing Hu, Yan Liu, Gong Chen, Xiao Ma, Shenghua Zhong, Qianwen Luo
The Hong Kong Polytechnic University, The Hong Kong Polytechnic University, The Hong Kong Polytechnic University, The Hong Kong Polytechnic University, Shenzhen University, The Hong Kong Polytechnic University
Abstract:
Calland-response is a musical technique that enriches the creativity of music, crafting coherent musical ideas that mirror the back-and-forth nature of human dialogue with distinct musical characteristics. Although this technique is integral to numerous musical compositions, it remains largely uncharted in automatic music composition. To enhance the creativity of machine-composed music, we first introduce the Call-Response Dataset (CRD) containing 19,155 annotated musical pairs and crafted comprehensive objective evaluation metrics for musical assessment. Then, we design a knowledge-enhanced learning-based method to bridge the gap between human and machine creativity. Specifically, we train the composition module using the call-response pairs, supplementing it with musical knowledge in terms of rhythm, melody, and harmony. Our experimental results underscore that our proposed model adeptly produces a wide variety of creative responses for various musical calls.



Paperid:60
Authors:Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua B. Tenenbaum, Tianmin Shu
Dartmouth College, Google Research, New York University, University of Virginia, Massachusetts Institute of Technology, Massachusetts Institute of Technology Johns Hopkins University
Abstract:
Multiagent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy.



Paperid:61
Authors:Shu Li, Ruimin Hu, Suhui Li, Liang Liao
Xidian University, Xidian Unversity, Xidian University, Nanyang Technological University
Abstract:
Spatiotemporal social behavior analysis is a technique that studies the social behavior patterns of objects and estimates their risks based on their trajectories. In social public scenarios such as train stations, hidden following behavior has become one of the most challenging issues due to its probability of evolving into violent events, which is more than 25%. In recent years, research on hidden following detection (HFD) has focused on differences in time series between hidden followers and normal pedestrians under two temporal characteristics: gaze and spatial distance. However, the timedomain representation for time series is irreversible and usually causes the loss of critical information. In this paper, we deeply study the expression efficiency of time/frequency domain features of time series, by exploring the recovery mechanism of features to source time series, we establish a fidelity estimation method for feature expression and a selection model for frequency-domain features based on the signal-to-distortion ratio (SDR). Experimental results demonstrate the feature fidelity of time series and HFD performance are positively correlated, and the fidelity of frequency-domain features and HFD performance are significantly better than the time-domain features. On both real and simulated datasets, the accuracy of the proposed method is increased by 3%, and the gaze-only module is improved by 10%. Related research has explored new methods for optimal feature selection based on fidelity, new patterns for efficient feature expression of hidden following behavior, and the mechanism of multimodal collaborative identification.



Paperid:62
Authors:Sifei Li, Yuxin Zhang, Fan Tang, Chongyang Ma, Weiming Dong, Changsheng Xu
MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Kuaishou Technology, MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, MAIS, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Abstract:
With the development of diffusion models, textguided image style transfer has demonstrated great controllable and high-quality results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we utilize a bias-reduced stylization technique to get stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and code are available at https://lsfhuihuiff.github.io/MusicTI/.



Paperid:63
Authors:Han Lu, Xiahai Zhuang, Qiang Luo
Fudan University, Fudan University, Fudan University
Abstract:
The human brain can effortlessly and reliably perceive emotions, whereas existing facial emotion recognition (FER) methods suffer from drawbacks such as complex model structures, high storage requirements, and poor interpretability. Inspired by the role of emotion concepts in visual perception coding within the human brain, we propose a dualpathway framework emulating the neural computation of emotion recognition. Specifically, these two pathways are designed to model the representation of emotion concepts in the brain and the visual perception process, respectively. For the former, we adopt a disentangled approach to extract emotion concepts from complex facial geometric attributes; for the latter, we employ an emotional confidence evaluation strategy to determine which concept is optimal for regularizing the perceptual coding. The proposed concept-regularized coding strategy endows the framework with flexibility and interpretability as well as good performances on several benchmarking FER datasets.



Paperid:64
Authors:Bingjun Luo, Zewen Wang, Jinpeng Wang, Junjie Zhu, Xibin Zhao, Yue Gao
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
Illumination variation has been a longterm challenge in real-world facial expression recognition (FER). Under uncontrolled or non-visible light conditions, near-infrared (NIR) can provide a simple and alternative solution to obtain high-quality images and supplement the geometric and texture details that are missing in the visible (VIS) domain. Due to the lack of large-scale NIR facial expression datasets, directly extending VIS FER methods to the NIR spectrum may be ineffective. Additionally, previous heterogeneous image synthesis methods are restricted by low controllability without prior task knowledge. To tackle these issues, we present the first approach, called for NIR-FER Stochastic Differential Equations (NFER-SDE), that transforms face expression appearance between heterogeneous modalities to the overfitting problem on small-scale NIR data. NFER-SDE can take the whole VIS source image as input and, together with domain-specific knowledge, guide the preservation of modality-invariant information in the high-frequency content of the image. Extensive experiments and ablation studies show that NFER-SDE significantly improves the performance of NIR FER and achieves state-of-the-art results on the only two available NIR FER datasets, Oulu-CASIA and Large-HFE.



Paperid:65
Authors:Gehua Ma, He Wang, Jingyuan Zhao, Rui Yan, Huajin Tang
College of Computer Science and Technology, Zhejiang University The State Key Lab of Brain-Machine Intelligence, Zhejiang University, College of Computer Science and Technology, Zhejiang University The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Group Data, The Great Eastern Life Assurance Company Limited Department of Statistics & Data Science, National University of Singapore, College of Computer Science and Technology, Zhejiang University of Technology, College of Computer Science and Technology, Zhejiang University The State Key Lab of Brain-Machine Intelligence, Zhejiang University MOE Frontier Science Center for Brain Science and Brain-Machine Integration, Zhejiang University
Abstract:
Existing approaches usually perform spatiotemporal representation in the spatial and temporal dimensions, respectively, which isolates the spatial and temporal natures of the target and leads to suboptimal embeddings. Neuroscience research has shown that the mammalian brain entorhinal-hippocampal system provides efficient graph representations for general knowledge. Moreover, entorhinal grid cells present concise spatial representations, while hippocampal place cells represent perception conjunctions effectively. Thus, the entorhinal-hippocampal system provides a novel angle for spatiotemporal representation, which inspires us to propose the SpatioTemporal aware Embedding framework (STE) and apply it to POIs (STEP). STEP considers two types of POI-specific representations: sequential representation and spatiotemporal conjunctive representation, learned using sparse unlabeled data based on the proposed graph-building policies. Notably, STEP jointly represents the spatiotemporal natures of POIs using both observations and contextual information from integrated spatiotemporal dimensions by constructing a spatiotemporal context graph. Furthermore, we introduce a successive POI recommendation method using STEP, which achieves state-of-the-art performance on two benchmarks. In addition, we demonstrate the excellent performance of the STE representation approach in other spatiotemporal representation-centered tasks through a case study of the traffic flow prediction problem. Therefore, this work provides a novel solution to spatiotemporal representation and paves a new way for spatiotemporal modeling-related tasks.



Paperid:66
Authors:Yuanyuan Mao, Xin Lin, Qin Ni, Liang He
Shanghai Key Laboratory of Multidimensional Information Processing, ECNU, Shanghai, China Department of Computer Science and Technology, East China Normal University, Shanghai Key Laboratory of Multidimensional Information Processing, ECNU, Shanghai, China Department of Computer Science and Technology, East China Normal University, Key Laboratory of Multilingual Education with AI, Shanghai International Studies University, Shanghai Key Laboratory of Multidimensional Information Processing, ECNU, Shanghai, China Department of Computer Science and Technology, East China Normal University
Abstract:
As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events, few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zeroshot, few-shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis.



Paperid:67
Authors:Junseok Park, Yoonsung Kim, Hee bin Yoo, Min Whoo Lee, Kibeom Kim, Won-Seok Choi, Minsu Lee, Byoung-Tak Zhang
Seoul National University, Seoul National University, Seoul National University, Seoul National University, Seoul National University, Seoul National University, Seoul National University AI Institute of Seoul National University (AIIS), Seoul National University AI Institute of Seoul National University (AIIS)
Abstract:
Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goaldirected learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models.



Paperid:68
Authors:Xuerui Qiu, Rui-Jie Zhu, Yuhong Chou, Zhaorui Wang, Liang-Jian Deng, Guoqi Li
Institute of Automation, Chinese Academy of Sciences School of Future Technology, University of Chinese Academy of Sciences University of Electronic Science and Technology of China, University of California, Santa Cruz, Xi’an Jiaotong University, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, Institute of Automation, Chinese Academy of Sciences Peng Cheng Laboratory
Abstract:
Spiking neural networks (SNNs) are emerging as an energyefficient alternative to traditional artificial neural networks (ANNs) due to their unique spike-based event-driven nature. Coding is crucial in SNNs as it converts external input stimuli into spatio-temporal feature sequences. However, most existing deep SNNs rely on direct coding that generates powerless spike representation and lacks the temporal dynamics inherent in human vision. Hence, we introduce Gated Attention Coding (GAC), a plug-and-play module that leverages the multi-dimensional gated attention unit to efficiently encode inputs into powerful representations before feeding them into the SNN architecture. GAC functions as a preprocessing layer that does not disrupt the spike-driven nature of the SNN, making it amenable to efficient neuromorphic hardware implementation with minimal modifications. Through an observer model theoretical analysis, we demonstrate GAC's attention mechanism improves temporal dynamics and coding efficiency. Experiments on CIFAR10/100 and ImageNet datasets demonstrate that GAC achieves state-of-the-art accuracy with remarkable efficiency. Notably, we improve top-1 accuracy by 3.10% on CIFAR100 with only 6-time steps and 1.07% on ImageNet while reducing energy usage to 66.9% of the previous works. To our best knowledge, it is the first time to explore the attention-based dynamic coding scheme in deep SNNs, with exceptional effectiveness and efficiency on large-scale datasets. Code is available at https://github.com/bollossom/GAC.



Paperid:69
Authors:Jiangrong Shen, Wenyao Ni, Qi Xu, Huajin Tang
Zhejiang University, Zhejiang University, Dalian University of Technology, Zhejiang University
Abstract:
The next generation of machine intelligence requires the capability of continual learning to acquire new knowledge without forgetting the old one while conserving limited computing resources. Spiking neural networks (SNNs), compared to artificial neural networks (ANNs), have more characteristics that align with biological neurons, which may be helpful as a potential gating function for knowledge maintenance in neural networks. Inspired by the selective sparse activation principle of context gating in biological systems, we present a novel SNN model with selective activation to achieve continual learning. The tracebased K-Winner-Take-All (K-WTA) and variable threshold components are designed to form the sparsity in selective activation in spatial and temporal dimensions of spiking neurons, which promotes the subpopulation of neuron activation to perform specific tasks. As a result, continual learning can be maintained by routing different tasks via different populations of neurons in the network. The experiments are conducted on MNIST and CIFAR10 datasets under the class incremental setting. The results show that the proposed SNN model achieves competitive performance similar to and even surpasses the other regularization-based methods deployed under traditional ANNs.



Paperid:70
Authors:Shanshan Wang, Zhen Zeng, Xun Yang, Ke Xu, Xingyi Zhang
Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, HeFei, China Institutes of Physical Science and Information Technology, Anhui University, HeFei, China, Institutes of Physical Science and Information Technology, Anhui University, HeFei, China, University of Science and Technology of China, HeFei, China, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, HeFei, China, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, HeFei, China
Abstract:
Cognitive Diagnosis Modeling aims to infer students' proficiency level on knowledge concepts from their response logs. Existing methods typically model students’ response processes as the interaction between students and exercises or concepts based on handcrafted or deeply-learned interaction functions. Despite their promising achievements, they fail to consider the relationship between students' cognitive states and affective states in learning, e.g., the feelings of frustration, boredom, or confusion with the learning content, which is insufficient for comprehensive cognitive diagnosis in intelligent education. To fill the research gap, we propose a novel Affect-aware Cognitive Diagnosis (ACD) model which can effectively diagnose the knowledge proficiency levels of students by taking into consideration the affective factors. Specifically, we first design a student affect perception module under the assumption that the affective state is jointly influenced by the student's affect trait and the difficulty of the exercise. Then, our inferred affective distribution is further used to estimate the student's subjective factors, i.e., guessing and slipping, respectively. Finally, we integrate the estimated guessing and slipping parameters with the basic neural cognitive diagnosis framework based on the DINA model, which facilitates the modeling of complex exercising interactions in a more accurate and interpretable fashion. Besides, we also extend our affect perception module in an unsupervised learning setting based on contrastive learning, thus significantly improving the compatibility of our ACD. To the best of our knowledge, we are the first to unify the cognition modeling and affect modeling into the same framework for student cognitive diagnosis. Extensive experiments on real-world datasets clearly demonstrate the effectiveness of our ACD. Our code is available at https://github.com/zeng-zhen/ACD.



Paperid:71
Authors:Yiming Wang, Bin Zhang, Yujiao Tang
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Electroencephalography (EEG) has proven to be effective in emotion analysis. However, current methods struggle with individual variations, complicating the generalization of models trained on data from source subjects to unseen target subjects. To tackle this issue, we propose the Denoising Mixed Mutual Reconstruction (DMMR) model, employing a twostage pre-training followed by fine-tuning approach. During the pre-training phase, DMMR leverages self-supervised learning through a multi-decoder autoencoder, which encodes and reconstructs features of one subject, aiming to generate features resembling those from other subjects within the same category, thereby encouraging the encoder to learn subject-invariant features. We introduce a hidden-layer mixed data augmentation approach to mitigate the limitations posed by the scarcity of source data, thereby extending the method to a two-stage process. To bolster stability against noise, we incorporate a noise injection method, named “Time Steps Shuffling”, into the input data. During the fine-tuning phase, an emotion classifier is integrated to extract emotion-related features. Experimental accuracy on the SEED and SEED-IV datasets reached 88.27% (±5.62) and 72.70% (±8.01), respectively, demonstrating state-of-the-art and comparable performance, thereby showcasing the superiority of DMMR. The proposed data augmentation and noise injection methods were observed to complementarily enhance accuracy and stability, thus alleviating the aforementioned issues.



Paperid:72
Authors:Jiyuan Zhang, Shiyan Chen, Yajing Zheng, Zhaofei Yu, Tiejun Huang
Peking University, Peking University, Peking University, Peking University, Peking University
Abstract:
The deocclusion problem, involving extracting clear background images by removing foreground occlusions, holds significant practical importance but poses considerable challenges. Most current research predominantly focuses on generating discrete images from calibrated camera arrays, but this approach often struggles with dense occlusions and fast motions due to limited perspectives and motion blur. To overcome these limitations, an effective solution requires the integration of multi-view visual information. The spike camera, as an innovative neuromorphic sensor, shows promise with its ultra-high temporal resolution and dynamic range. In this study, we propose a novel approach that utilizes a single spike camera for continuous multi-view imaging to address occlusion removal. By rapidly moving the spike camera, we capture a dense stream of spikes from occluded scenes. Our model, SpkOccNet, processes these spikes by integrating multi-view spatial-temporal information via long-short-window feature extractor (LSW) and employs a novel cross-view mutual attention-based module (CVA) for effective fusion and refinement. Additionally, to facilitate research in occlusion removal, we introduce the S-OCC dataset, which consists of real-world spike-based data. Experimental results demonstrate the efficiency and generalization capabilities of our model in effectively removing dense occlusions across diverse scenes. Public project page: https://github.com/Leozhangjiyuan/SpikeDeOcclusion.



Paperid:73
Authors:Yuhang Zhang, Yue Yao, Xuannan Liu, Lixiong Qin, Wenjing Wang, Weihong Deng
Beijing University of Posts and Telecommunicates, The Australian National University, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Facial expression recognition (FER) models are typically trained on datasets with a fixed number of seven basic classes. However, recent research works (Cowen et al. 2021; Bryant et al. 2022; Kollias 2023) point out that there are far more expressions than the basic ones. Thus, when these models are deployed in the real world, they may encounter unknown classes, such as compound expressions that cannot be classified into existing basic classes. To address this issue, we propose the openset FER task for the first time. Though there are many existing open-set recognition methods, we argue that they do not work well for open-set FER because FER data are all human faces with very small inter-class distances, which makes the open-set samples very similar to close-set samples. In this paper, we are the first to transform the disadvantage of small inter-class distance into an advantage by proposing a new way for open-set FER. Specifically, we find that small inter-class distance allows for sparsely distributed pseudo labels of open-set samples, which can be viewed as symmetric noisy labels. Based on this novel observation, we convert the open-set FER to a noisy label detection problem. We further propose a novel method that incorporates attention map consistency and cycle training to detect the open-set samples. Extensive experiments on various FER datasets demonstrate that our method clearly outperforms state-of-the-art open-set recognition methods by large margins. Code is available at https://github.com/zyh-uaiaaaa.



Paperid:74
Authors:Feiyu Zhu, Reid Simmons
Carnegie Mellon University, Carnegie Mellon University
Abstract:
Large language models contain noisy general knowledge of the world, yet are hard to train or finetune. In contrast cognitive architectures have excellent interpretability and are flexible to update but require a lot of manual work to instantiate. In this work, we combine the best of both worlds: bootstrapping a cognitive-based model with the noisy knowledge encoded in large language models. Through an embodied agent doing kitchen tasks, we show that our proposed framework yields better efficiency compared to an agent entirely based on large language models. Our experiments also indicate that the cognitive agent bootstrapped using this framework can generalize to novel environments and be scaled to complex tasks.



Paperid:75
Authors:Yangfu Zhu, Yue Xia, Meiling Li, Tingting Zhang, Bin Wu
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Personality detection is a fundamental task for user psychology research. One of the biggest challenges in personality detection lies in the quantitative limitation of labeled data collected by completing the personality questionnaire, which is very timeconsuming and labor-intensive. Most of the existing works are mainly devoted to learning the rich representations of posts based on labeled data. However, they still suffer from the inherent weakness of the amount limitation of labels, which potentially restricts the capability of the model to deal with unseen data. In this paper, we construct a heterogeneous personality graph for each labeled and unlabeled user and develop a novel psycholinguistic augmented graph neural network to detect personality in a semi-supervised manner, namely Semi-PerGCN. Specifically, our model first explores a supervised Personality Graph Neural Network (PGNN) to refine labeled user representation on the heterogeneous graph. For the remaining massive unlabeled users, we utilize the empirical psychological knowledge of the Linguistic Inquiry and Word Count (LIWC) lexicon for multi-view graph augmentation and perform unsupervised graph consistent constraints on the parameters shared PGNN. During the learning process of finite labeled users, noise-invariant learning on a large scale of unlabeled users is combined to enhance the generalization ability. Extensive experiments on three real-world datasets, Youtube, PAN2015, and MyPersonality demonstrate the effectiveness of our Semi-PerGCN in personality detection, especially in scenarios with limited labeled users.



Paperid:76
Authors:Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, Daesik Kim, Seung-Hun Nam, Kibeom Hong
NAVER WEBTOON AI, NAVER WEBTOON AI, NAVER WEBTOON AI Harvard University, KAIST, NAVER WEBTOON AI, NAVER WEBTOON AI, SwatchOn
Abstract:
Recent progresses in largescale text-to-image models have yielded remarkable accomplishments, finding various applications in art domain. However, expressing unique characteristics of an artwork (e.g. brushwork, colortone, or composition) with text prompts alone may encounter limitations due to the inherent constraints of verbal description. To this end, we introduce DreamStyle, a novel framework designed for artistic image synthesis, proficient in both text-to-image synthesis and style transfer. DreamStyle optimizes a multi-stage textual embedding with a context-aware text prompt, resulting in prominent image quality. In addition, with content and style guidance, DreamStyle exhibits flexibility to accommodate a range of style references. Experimental results demonstrate its superior performance across multiple scenarios, suggesting its promising potential in artistic product creation. Project page: https://nmhkahn.github.io/dreamstyler/



Paperid:77
Authors:Seungjun An, Seonghoon Park, Gyeongnyeon Kim, Jeongyeol Baek, Byeongwon Lee, Seungryong Kim
Korea University, Korea University, Korea University, SK Telecom, SK Telecom, Korea University
Abstract:
With the increasing importance of video data in realworld applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.



Paperid:78
Authors:Xiaoqi An, Lin Zhao, Chen Gong, Nannan Wang, Di Wang, Jian Yang
PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology State Key Laboratory of Integrated Services Networks, Xidian University, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology State Key Laboratory of Integrated Services Networks, Xidian University, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, State Key Laboratory of Integrated Services Networks, Xidian University, State Key Laboratory of Integrated Services Networks, Xidian University, PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology
Abstract:
Highresolution representation is essential for achieving good performance in human pose estimation models. To obtain such features, existing works utilize high-resolution input images or fine-grained image tokens. However, this dense high-resolution representation brings a significant computational burden. In this paper, we address the following question: "Only sparse human keypoint locations are detected for human pose estimation, is it really necessary to describe the whole image in a dense, high-resolution manner?" Based on dynamic transformer models, we propose a framework that only uses Sparse High-resolution Representations for human Pose estimation (SHaRPose). In detail, SHaRPose consists of two stages. At the coarse stage, the relations between image regions and keypoints are dynamically mined while a coarse estimation is generated. Then, a quality predictor is applied to decide whether the coarse estimation results should be refined. At the fine stage, SHaRPose builds sparse high-resolution representations only on the regions related to the keypoints and provides refined high-precision human pose estimations. Extensive experiments demonstrate the outstanding performance of the proposed method. Specifically, compared to the state-of-the-art method ViTPose, our model SHaRPose-Base achieves 77.4 AP (+0.5 AP) on the COCO validation set and 76.7 AP (+0.5 AP) on the COCO test-dev set, and infers at a speed of 1.4x faster than ViTPose-Base. Code is available at https://github.com/AnxQ/sharpose.



Paperid:79
Authors:Anastasia Antsiferova, Khaled Abud, Aleksandr Gushchin, Ekaterina Shumitskaya, Sergey Lavrushkin, Dmitriy Vatolin
MSU Institute for Artificial Intelligence ISP RAS Research Center for Trusted Artificial Intelligence, Lomonosov Moscow State University, MSU Institute for Artificial Intelligence ISP RAS Research Center for Trusted Artificial Intelligence Lomonosov Moscow State University, ISP RAS Research Center for Trusted Artificial Intelligence Lomonosov Moscow State University, MSU Institute for Artificial Intelligence ISP RAS Research Center for Trusted Artificial Intelligence, MSU Institute for Artificial Intelligence ISP RAS Research Center for Trusted Artificial Intelligence Lomonosov Moscow State University
Abstract:
Nowadays, neuralnetwork-based image- and video-quality metrics perform better than traditional methods. However, they also became more vulnerable to adversarial attacks that increase metrics' scores without improving visual quality. The existing benchmarks of quality metrics compare their performance in terms of correlation with subjective quality and calculation time. Nonetheless, the adversarial robustness of image-quality metrics is also an area worth researching. This paper analyses modern metrics' robustness to different adversarial attacks. We adapted adversarial attacks from computer vision tasks and compared attacks' efficiency against 15 no-reference image- and video-quality metrics. Some metrics showed high resistance to adversarial attacks, which makes their usage in benchmarks safer than vulnerable metrics. The benchmark accepts submissions of new metrics for researchers who want to make their metrics more robust to attacks or to find such metrics for their needs. The latest results can be found online: https://videoprocessing.ai/benchmarks/metrics-robustness.html.



Paperid:80
Authors:Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha
AWS AI Labs, AWS AI Labs, AWS AI Labs, AWS AI Labs, School of Computing at University of Utah, AWS AI Labs
Abstract:
We propose DocFormerv2, a multimodal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines - On TabFact (+4.3%), InfoVQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, DocFormerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, DocFormerv2 understands multiple modalities better than prior-art in VDU.



Paperid:81
Authors:Zhongjie Ba, Qingyu Liu, Zhenguang Liu, Shuang Wu, Feng Lin, Li Lu, Kui Ren
State Key Lab. of Blockchain and Data Security, Zhejiang University, Hangzhou, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China, State Key Lab. of Blockchain and Data Security, Zhejiang University, Hangzhou, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China, State Key Lab. of Blockchain and Data Security, Zhejiang University, Hangzhou, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China, Black Sesame Technologies, Singapore, State Key Lab. of Blockchain and Data Security, Zhejiang University, Hangzhou, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China, State Key Lab. of Blockchain and Data Security, Zhejiang University, Hangzhou, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China, State Key Lab. of Blockchain and Data Security, Zhejiang University, Hangzhou, China ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, China
Abstract:
Deepfake technology has given rise to a spectrum of novel and compelling applications. Unfortunately, the widespread proliferation of highfidelity fake videos has led to pervasive confusion and deception, shattering our faith that seeing is believing. One aspect that has been overlooked so far is that current deepfake detection approaches may easily fall into the trap of overfitting, focusing only on forgery clues within one or a few local regions. Moreover, existing works heavily rely on neural networks to extract forgery features, lacking theoretical constraints guaranteeing that sufficient forgery clues are extracted and superfluous features are eliminated. These deficiencies culminate in unsatisfactory accuracy and limited generalizability in real-life scenarios. In this paper, we try to tackle these challenges through three designs: (1) We present a novel framework to capture broader forgery clues by extracting multiple non-overlapping local representations and fusing them into a global semantic-rich feature. (2) Based on the information bottleneck theory, we derive Local Information Loss to guarantee the orthogonality of local representations while preserving comprehensive task-relevant information. (3) Further, to fuse the local representations and remove task-irrelevant information, we arrive at a Global Information Loss through the theoretical analysis of mutual information. Empirically, our method achieves state-of-the-art performance on five benchmark datasets. Our code is available at https://github.com/QingyuLiu/Exposing-the-Deception, hoping to inspire researchers.



Paperid:82
Authors:Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen
Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China, Westlake University Institute of Advanced Technology, Westlake Institute for Advanced Study, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China RIKEN AIP, Westlake University Institute of Advanced Technology, Westlake Institute for Advanced Study, School of Electrical Engineering, Xi’an University of Technology, Xi'an, China, Westlake University Institute of Advanced Technology, Westlake Institute for Advanced Study, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China
Abstract:
Recently, despite the unprecedented success of large pretrained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA. However, a major challenge for directly deploying such models on downstream UDA tasks is prompt engineering, which requires aligning the domain knowledge of source and target domains, since the performance of UDA is severely influenced by a good domain-invariant representation. We further propose a Prompt-based Distribution Alignment (PDA) method to incorporate the domain knowledge into prompt learning. Specifically, PDA employs a two-branch prompt-tuning paradigm, namely base branch and alignment branch. The base branch focuses on integrating class-related representation into prompts, ensuring discrimination among different classes. To further minimize domain discrepancy, for the alignment branch, we construct feature banks for both the source and target domains and propose image-guided feature tuning (IFT) to make the input attend to feature banks, which effectively integrates self-enhanced and cross-domain features into the model. In this way, these two branches can be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct extensive experiments on three benchmarks to demonstrate that our proposed PDA achieves state-of-the-art performance. The code is available at https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment.



Paperid:83
Authors:Peijun Bao, Yong Xia, Wenhan Yang, Boon Poh Ng, Meng Hwa Er, Alex C. Kot
Nanyang Technological University, Northwestern Polytechnical University, Peng Cheng Laboratory, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University
Abstract:
This paper for the first time leverages multimodal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weaklysupervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities† would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without compromising the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.



Paperid:84
Authors:Peijun Bao, Zihao Shao, Wenhan Yang, Boon Poh Ng, Meng Hwa Er, Alex C. Kot
Nanyang Technological University, Peking University, Peng Cheng Laboratory, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University
Abstract:
Natural language video localization plays a pivotal role in video understanding, and leveraging weaklylabeled data is considered a promising approach to circumvent the laborintensive process of manual annotations. However, this approach encounters two significant challenges: 1) limited input distribution, namely that the limited writing styles of the language query, annotated by human annotators, hinder the model’s generalization to real-world scenarios with diverse vocabularies and sentence structures; 2) the incomplete ground truth, whose supervision guidance is insufficient. To overcome these challenges, we propose an omnipotent distillation algorithm with large language models (LLM). The distribution of the input sample is enriched to obtain diverse multi-view versions while a consistency then comes to regularize the consistency of their results for distillation. Specifically, we first train our teacher model with the proposed intra-model agreement, where multiple sub-models are supervised by each other. Then, we leverage the LLM to paraphrase the language query and distill the teacher model to a lightweight student model by enforcing the consistency between the localization results of the paraphrased sentence and the original one. In addition, to assess the generalization of the model across different dimensions of language variation, we create extensive datasets by building upon existing datasets. Our experiments demonstrate substantial performance improvements adaptively to diverse kinds of language queries.



Paperid:85
Authors:Qiqi Bao, Zheng Hui, Rui Zhu, Peiran Ren, Xuansong Xie, Wenming Yang
Tsinghua university, Institute for Intelligent computing, alibaba group, City, University of London, Institute for Intelligent computing, alibaba group, Institute for Intelligent computing, alibaba group, Tsinghua University
Abstract:
Generative diffusion prior captured from the offthe-shelf denoising diffusion generative model has recently attained significant interest. However, several attempts have been made to adopt diffusion models to noisy inverse problems either fail to achieve satisfactory results or require a few thousand iterations to achieve high-quality reconstructions. In this work, we propose a diffusion-based image restoration with error contraction and error correction (DiffECC) method. Two strategies are introduced to contract the restoration error in the posterior sampling process. First, we combine existing CNN-based approaches with diffusion models to ensure data consistency from the beginning. Second, to amplify the error contraction effects of the noise, a restart sampling algorithm is designed. In the error correction strategy, the estimation-correction idea is proposed on both the data term and the prior term. Solving them iteratively within the diffusion sampling framework leads to superior image generation results. Experimental results for image restoration tasks such as super-resolution (SR), Gaussian deblurring, and motion deblurring demonstrate that our approach can reconstruct high-quality images compared with state-of-the-art sampling-based diffusion models.



Paperid:86
Authors:Xiaoyi Bao, Jie Qin, Siyang Sun, Xingang Wang, Yun Zheng
School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences Alibaba Group, School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences, Alibaba Group, Institute of Automation, Chinese Academy of Sciences, Alibaba Group
Abstract:
For fewshot semantic segmentation, the primary task is to extract class-specific intrinsic information from limited labeled data. However, the semantic ambiguity and inter-class similarity of previous methods limit the accuracy of pixel-level foreground-background classification. To alleviate these issues, we propose the Relevant Intrinsic Feature Enhancement Network (RiFeNet). To improve the semantic consistency of foreground instances, we propose an unlabeled branch as an efficient data utilization method, which teaches the model how to extract intrinsic features robust to intra-class differences. Notably, during testing, the proposed unlabeled branch is excluded without extra unlabeled data and computation. Furthermore, we extend the inter-class variability between foreground and background by proposing a novel multi-level prototype generation and interaction module. The different-grained complementarity between global and local prototypes allows for better distinction between similar categories. The qualitative and quantitative performance of RiFeNet surpasses the state-of-the-art methods on PASCAL-5i and COCO benchmarks.



Paperid:87
Authors:Mazal Bethany, Brandon Wherry, Nishant Vishwamitra, Peyman Najafirad
University of Texas at San Antonio Secure AI and Autonomy Lab, University of Texas at San Antonio Secure AI and Autonomy Lab, University of Texas at San Antonio, University of Texas at San Antonio Secure AI and Autonomy Lab
Abstract:
Social media platforms are being increasingly used by malicious actors to share unsafe content, such as images depicting sexual activity, cyberbullying, and selfharm. Consequently, major platforms use artificial intelligence (AI) and human moderation to obfuscate such images to make them safer. Two critical needs for obfuscating unsafe images is that an accurate rationale for obfuscating image regions must be provided, and the sensitive regions should be obfuscated (e.g. blurring) for users' safety. This process involves addressing two key problems: (1) the reason for obfuscating unsafe images demands the platform to provide an accurate rationale that must be grounded in unsafe image-specific attributes, and (2) the unsafe regions in the image must be minimally obfuscated while still depicting the safe regions. In this work, we address these key issues by first performing visual reasoning by designing a visual reasoning model (VLM) conditioned on pre-trained unsafe image classifiers to provide an accurate rationale grounded in unsafe image attributes, and then proposing a counterfactual explanation algorithm that minimally identifies and obfuscates unsafe regions for safe viewing, by first utilizing an unsafe image classifier attribution matrix to guide segmentation for a more optimal subregion segmentation followed by an informed greedy search to determine the minimum number of subregions required to modify the classifier's output based on attribution score. Extensive experiments on uncurated data from social networks emphasize the efficacy of our proposed method. We make our code available at: https://github.com/SecureAIAutonomyLab/ConditionalVLM



Paperid:88
Authors:Aneesh Bhattacharya, Manas Paranjape, Uttaran Bhattacharya, Aniket Bera
Purdue University IIIT Naya Raipur, Purdue Univesity, Adobe Research, Purdue University
Abstract:
We present DanceAnyWay, a generative learning method to synthesize beatguided dances of 3D human characters synchronized with music. Our method learns to disentangle the dance movements at the beat frames from the dance movements at all the remaining frames by operating at two hierarchical levels. At the coarser "beat" level, it encodes the rhythm, pitch, and melody information of the input music via dedicated feature representations only at the beat frames. It leverages them to synthesize the beat poses of the target dances using a sequence-to-sequence learning framework. At the finer "repletion" level, our method encodes similar rhythm, pitch, and melody information from all the frames of the input music via dedicated feature representations. It generates the full dance sequences by combining the synthesized beat and repletion poses and enforcing plausibility through an adversarial learning framework. Our training paradigm also enforces fine-grained diversity in the synthesized dances through a randomized temporal contrastive loss, which ensures different segments of the dance sequences have different movements and avoids motion freezing or collapsing to repetitive movements. We evaluate the performance of our approach through extensive experiments on the benchmark AIST++ dataset and observe improvements of about 7%-12% in motion quality metrics and 1.5%-4% in motion diversity metrics over the current baselines, respectively. We also conducted a user study to evaluate the visual quality of our synthesized dances. We noted that, on average, the samples generated by our method were about 9-48% more preferred by the participants and had a 4-27% better five-point Likert-scale score over the best available current baseline in terms of motion quality and synchronization. Our source code and project page are available at https://github.com/aneeshbhattacharya/DanceAnyWay.



Paperid:89
Authors:Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
University of Surrey, University of Surrey, University of Surrey, Imperial College London, University of Surrey
Abstract:
Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. Code: https://github.com/Surrey-UPLab/DiffSED



Paperid:90
Authors:Qi Bi, Shaodi You, Theo Gevers
University of Amsterdam, University of Amsterdam, University of Amsterdam
Abstract:
Learning scene semantics that can be well generalized to foggy conditions is important for safetycrucial applications such as autonomous driving. Existing methods need both annotated clear images and foggy images to train a curriculum domain adaptation model. Unfortunately, these methods can only generalize to the target foggy domain that has seen in the training stage, but the foggy domains vary a lot in both urban-scene styles and fog styles. In this paper, we propose to learn scene segmentation well generalized to foggy-scenes under the domain generalization setting, which does not involve any foggy images in the training stage and can generalize to any arbitrary unseen foggy scenes. We argue that an ideal segmentation model that can be well generalized to foggy-scenes need to simultaneously enhance the content, de-correlate the urban-scene style and de-correlate the fog style. As the content (e.g., scene semantic) rests more in low-frequency features while the style of urban-scene and fog rests more in high-frequency features, we propose a novel bi-directional wavelet guidance (BWG) mechanism to realize the above three objectives in a divide-and-conquer manner. With the aid of Haar wavelet transformation, the low frequency component is concentrated on the content enhancement self-attention, while the high frequency component is shifted to the style and fog self-attention for de-correlation purpose. It is integrated into existing mask-level Transformer segmentation pipelines in a learnable fashion. Large-scale experiments are conducted on four foggy-scene segmentation datasets under a variety of interesting settings. The proposed method significantly outperforms existing directly-supervised, curriculum domain adaptation and domain generalization segmentation methods. Source code is available at https://github.com/BiQiWHU/BWG.



Paperid:91
Authors:Qi Bi, Jingjun Yi, Hao Zheng, Wei Ji, Yawen Huang, Yuexiang Li, Yefeng Zheng
Jarvis Research Center, Tencent YouTu Lab School of Remote Sensing and Information Engineering, Wuhan University, Jarvis Research Center, Tencent YouTu Lab School of Remote Sensing and Information Engineering, Wuhan University, Jarvis Research Center, Tencent YouTu Lab, Department of Electrical and Computer Engineering, University of Alberta, Jarvis Research Center, Tencent YouTu Lab, Medical AI ReSearch (MARS) Group, Guangxi Medical University, Jarvis Research Center, Tencent YouTu Lab
Abstract:
Domain generalized medical image segmentation requires models to learn from multiple source domains and generalize well to arbitrary unseen target domain. Such a task is both technically challenging and clinically practical, due to the domain shift problem (i.e., images are collected from different hospitals and scanners). Existing methods focused on either learning shapeinvariant representation or reaching consensus among the source domains. An ideal generalized representation is supposed to show similar pattern responses within the same channel for cross-domain images. However, to deal with the significant distribution discrepancy, the network tends to capture similar patterns by multiple channels, while different cross-domain patterns are also allowed to rest in the same channel. To address this issue, we propose to leverage channel-wise decoupled deep features as queries. With the aid of cross-attention mechanism, the long-range dependency between deep and shallow features can be fully mined via self-attention and then guides the learning of generalized representation. Besides, a relaxed deep whitening transformation is proposed to learn channel-wise decoupled features in a feasible way. The proposed decoupled fea- ture query (DFQ) scheme can be seamlessly integrate into the Transformer segmentation model in an end-to-end manner. Extensive experiments show its state-of-the-art performance, notably outperforming the runner-up by 1.31% and 1.98% with DSC metric on generalized fundus and prostate benchmarks, respectively. Source code is available at https://github.com/BiQiWHU/DFQ.



Paperid:92
Authors:Qi Bi, Shaodi You, Theo Gevers
University of Amsterdam, University of Amsterdam, University of Amsterdam
Abstract:
Domaingeneralized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike generic domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes. In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS. The main idea is to enhance the focus of the fundamental component, the mask attention mechanism, in Transformer segmentation models on content information. We have observed through empirical analysis that a mask representation effectively captures pixel segments, albeit with reduced robustness to style variations. Conversely, its lower-resolution counterpart exhibits greater ability to accommodate style variations, while being less proficient in representing pixel segments. To harness the synergistic attributes of these two approaches, we introduce a novel content-enhanced mask attention mechanism. It learns mask queries from both the image feature and its down-sampled counterpart, aiming to simultaneously encapsulate the content and address stylistic variations. These features are fused into a Transformer decoder and integrated into a multi-resolution content-enhanced mask attention learning scheme. Extensive experiments conducted on various domain-generalized urban-scene segmentation datasets demonstrate that the proposed CMFormer significantly outperforms existing CNN-based methods by up to 14.0% mIoU and the contemporary HGFormer by up to 1.7% mIoU. The source code is publicly available at https://github.com/BiQiWHU/CMFormer.



Paperid:93
Authors:Siyuan Bian, Jiefeng Li, Jiasheng Tang, Cewu Lu
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Department of Computer Science and Engineering, Shanghai Jiao Tong University, DAMO Academy, Alibaba group Hupan Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Abstract:
Accurate human shape recovery from a monocular RGB image is a challenging task because humans come in different shapes and sizes and wear different clothes. In this paper, we propose ShapeBoost, a new human shape recovery framework that achieves pixellevel alignment even for rare body shapes and high accuracy for people wearing different types of clothes. Unlike previous approaches that rely on the use of PCA-based shape coefficients, we adopt a new human shape parameterization that decomposes the human shape into bone lengths and the mean width of each part slice. This part-based parameterization technique achieves a balance between flexibility and validity using a semi-analytical shape reconstruction algorithm. Based on this new parameterization, a clothing-preserving data augmentation module is proposed to generate realistic images with diverse body shapes and accurate annotations. Experimental results show that our method outperforms other state-of-the-art methods in diverse body shape situations as well as in varied clothing situations.



Paperid:94
Authors:Yequan Bie, Luyang Luo, Hao Chen
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Department of Computer Science and Engineering, Hong Kong University of Science and Technology Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute
Abstract:
Blackbox deep learning approaches have showcased significant potential in the realm of medical image analysis. However, the stringent trustworthiness requirements intrinsic to the medical field have catalyzed research into the utilization of Explainable Artificial Intelligence (XAI), with a particular focus on concept-based methods. Existing concept-based methods predominantly apply concept annotations from a single perspective (e.g., global level), neglecting the nuanced semantic relationships between sub-regions and concepts embedded within medical images. This leads to underutilization of the valuable medical information and may cause models to fall short in harmoniously balancing interpretability and performance when employing inherently interpretable architectures such as Concept Bottlenecks. To mitigate these shortcomings, we propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata, encompassing the image level, token level, and concept level. Moreover, our method allows for model intervention and offers both textual and visual explanations in terms of human-interpretable concepts. Experimental results on three skin image datasets demonstrate that our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis. The code is available at https://github.com/Tommy-Bie/MICA.



Paperid:95
Authors:Alexander Black, Jing Shi, Yifei Fan, Tu Bui, John Collomosse
University of Surrey, Adobe Research, Adobe Research, University of Surrey, Adobe Research
Abstract:
We present VIXEN a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing image difference captioning (IDC) datasets by training on synthetically manipulated images from the recent InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT-3. We show that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content. Code and data are available at http://github.com/alexblck/vixen



Paperid:96
Authors:Qingwen Bu, Sungrae Park, Minsoo Khang, Yichuan Cheng
Shanghai Jiao Tong University Shanghai AI Laboratory, Upstage AI, Upstage AI, City University of Hong Kong
Abstract:
Existing techniques for text detection can be broadly classified into two primary groups: segmentationbased and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance. Our code is available at https://github.com/retsuh-bqw/SRFormer-Text-Det.



Paperid:97
Authors:Pingping Cai, Deja Scott, Xiaoguang Li, Song Wang
University of South Carolina, University of South Carolina, University of South Carolina, University of South Carolina
Abstract:
Point cloud shape completion, which aims to reconstruct the missing regions of the incomplete point clouds with plausible shapes, is an illposed and challenging task that benefits many downstream 3D applications. Prior approaches achieve this goal by employing a two-stage completion framework, generating a coarse yet complete seed point cloud through an encoder-decoder network, followed by refinement and upsampling. However, the encoded features suffer from information loss of the missing portion, leading to an inability of the decoder to reconstruct seed points with detailed geometric clues. To tackle this issue, we propose a novel Orthogonal Dictionary Guided Shape Completion Network (ODGNet). The proposed ODGNet consists of a Seed Generation U-Net, which leverages multi-level feature extraction and concatenation to significantly enhance the representation capability of seed points, and Orthogonal Dictionaries that can learn shape priors from training samples and thus compensate for the information loss of the missing portions during inference. Our design is simple but to the point, extensive experiment results indicate that the proposed method can reconstruct point clouds with more details and outperform previous state-of-the-art counterparts. The implementation code is available at https://github.com/corecai163/ODGNet.



Paperid:98
Authors:Qing Cai, Mu Li, Dongwei Ren, Jun Lyu, Haiyong Zheng, Junyu Dong, Yee-Hong Yang
Ocean University of China, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, The Hong Kong Polytechnic University, Ocean University of China, Ocean University of China, University of Alberta
Abstract:
Omnidirectional images have attracted significant attention in recent years due to the rapid development of virtual reality technologies. Equirectangular projection (ERP), a naive form to store and transfer omnidirectional images, however, is challenging for existing twodimensional (2D) image super-resolution (SR) methods due to its inhomogeneous distributed sampling density and distortion across latitude. In this paper, we make one of the first attempts to design a spherical pseudo-cylindrical representation, which not only allows pixels at different latitudes to adaptively adopt the best distinct sampling density but also is model-agnostic to most off-the-shelf SR methods, enhancing their performances. Specifically, we start by upsampling each latitude of the input ERP image and design a computationally tractable optimization algorithm to adaptively obtain a (sub)-optimal sampling density for each latitude of the ERP image. Addressing the distortion of ERP, we introduce a new viewport-based training loss based on the original 3D sphere format of the omnidirectional image, which inherently lacks distortion. Finally, we present a simple yet effective recursive progressive omnidirectional SR network to showcase the feasibility of our idea. The experimental results on public datasets demonstrate the effectiveness of the proposed method as well as the consistently superior performance of our method over most state-of-the-art methods both quantitatively and qualitatively.



Paperid:99
Authors:Qingyuan Cai, Xuecai Hu, Saihui Hou, Li Yao, Yongzhen Huang
Beijing Normal University, Beijing Normal University, Beijing Normal University, Beijing Normal University, School of Artificial Intelligence, Beijing Normal University
Abstract:
Recently, diffusionbased methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3d pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modelling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the SOTA disentangled-based, non-disentangled based, and probabilistic approaches by 10.0%, 2.0%, and 1.3%, respectively.



Paperid:100
Authors:Xiuding Cai, Yaoyao Zhu, Dong Miao, Linjie Fu, Yu Yao
Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu, China University of Chinese Academic Sciences, Beijing, China, Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu, China University of Chinese Academic Sciences, Beijing, China, Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu, China University of Chinese Academic Sciences, Beijing, China, Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu, China University of Chinese Academic Sciences, Beijing, China, Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu, China University of Chinese Academic Sciences, Beijing, China
Abstract:
In an unpaired setting, lacking sufficient content constraints for imageto-image translation (I2I) tasks, GAN-based approaches are usually prone to model collapse. Current solutions can be divided into two categories, reconstruction-based and Siamese network-based. The former requires that the transformed or transforming image can be perfectly converted back to the original image, which is sometimes too strict and limits the generative performance. The latter involves feeding the original and generated images into a feature extractor and then matching their outputs. This is not efficient enough, and a universal feature extractor is not easily available. In this paper, we propose EnCo, a simple but efficient way to maintain the content by constraining the representational similarity in the latent space of patch-level features from the same stage of the encoder and decoder of the generator. For the similarity function, we use a simple MSE loss instead of contrastive loss, which is currently widely used in I2I tasks. Benefits from the design, EnCo training is extremely efficient, while the features from the encoder produce a more positive effect on the decoding, leading to more satisfying generations. In addition, we rethink the role played by discriminators in sampling patches and propose a discriminative attention-guided (DAG) patch sampling strategy to replace random sampling. DAG is parameter-free and only requires negligible computational overhead, while significantly improving the performance of the model. Extensive experiments on multiple datasets demonstrate the effectiveness and advantages of EnCo, and we achieve multiple state-of-the-art compared to previous methods.



Paperid:101
Authors:Yanlu Cai, Weizhong Zhang, Yuan Wu, Cheng Jin
Fudan University, Shanghai, China, Fudan University, Shanghai, China Innovation Center of Calligraphy and Painting Creation Technology, MCT, China, Fudan University, Shanghai, China, Fudan University, Shanghai, China Innovation Center of Calligraphy and Painting Creation Technology, MCT, China
Abstract:
Depth uncertainty is a core challenge in 3D human pose estimation, especially when the camera parameters are unknown. Previous methods try to reduce the impact of depth uncertainty by multiview and/or multi-frame feature fusion to utilize more spatial and temporal information. However, they generally lead to marginal improvements and their performance still cannot match the camera-parameter-required methods. The reason is that their handcrafted fusion schemes cannot fuse the features flexibly, e.g., the multi-view and/or multi-frame features are fused separately. Moreover, the diverse and complicated fusion schemes make the principle for developing effective fusion schemes unclear and also raises an open problem that whether there exist more simple and elegant fusion schemes. To address these issues, this paper proposes an extremely concise unified feature fusion transformer (FusionFormer) with minimized handcrafted design for 3D pose estimation. FusionFormer fuses both the multi-view and multi-frame features in a unified fusion scheme, in which all the features are accessible to each other and thus can be fused flexibly. Experimental results on several mainstream datasets demonstrate that FusionFormer achieves state-of-the-art performance. To our best knowledge, this is the first camera-parameter-free method to outperform the existing camera-parameter-required methods, revealing the tremendous potential of camera-parameter-free models. These impressive experimental results together with our concise feature fusion scheme resolve the above open problem. Another appealing feature of FusionFormer we observe is that benefiting from its effective fusion scheme, we can achieve impressive performance with smaller model size and less FLOPs.



Paperid:102
Authors:Yufei Cai, Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hu Han, Wangmeng Zuo
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Harbin Institute of Technology, Harbin Institute of Technology, Tomorrow Advancing Life, Tomorrow Advancing Life, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Harbin Institute of Technology Pengcheng Lab
Abstract:
Customized textto-image generation, which aims to learn user-specified concepts with a few images, has drawn significant attention recently. However, existing methods usually suffer from overfitting issues and entangle the subject-unrelated information (e.g., background and pose) with the learned concept, limiting the potential to compose concept into new scenes. To address these issues, we propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation. Unlike conventional methods that learn a single concept embedding from the given images, our DETEX represents each image using multiple word embeddings during training, i.e., a learnable image-shared subject embedding and several image-specific subject-unrelated embeddings. To decouple irrelevant attributes (i.e., background and pose) from the subject embedding, we further present several attribute mappers that encode each image as several image-specific subject-unrelated embeddings. To encourage these unrelated embeddings to capture the irrelevant information, we incorporate them with corresponding attribute words and propose a joint training strategy to facilitate the disentanglement. During inference, we only use the subject embedding for image generation, while selectively using image-specific embeddings to retain image-specified attributes. Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. Our code will be available at https://github.com/PrototypeNx/DETEX.



Paperid:103
Authors:Zikui Cai, Zhongpai Gao, Benjamin Planche, Meng Zheng, Terrence Chen, M. Salman Asif, Ziyan Wu
University of California, Riverside, CA, United Imaging Intelligence, Burlington, MA, United Imaging Intelligence, Burlington, MA, Rensselaer Polytechnic Institute, Troy, NY, United Imaging Intelligence, Burlington, MA, University of California, Riverside, CA, United Imaging Intelligence, Burlington, MA
Abstract:
With the rise of cameras and smart sensors, humanity generates an exponential amount of data. This valuable information, including underrepresented cases like AI in medical settings, can fuel new deeplearning tools. However, data scientists must prioritize ensuring privacy for individuals in these untapped datasets, especially for images or videos with faces, which are prime targets for identification methods. Proposed solutions to de-identify such images often compromise non-identifying facial attributes relevant to downstream tasks. In this paper, we introduce Disguise, a novel algorithm that seamlessly de-identifies facial images while ensuring the usability of the modified data. Unlike previous approaches, our solution is firmly grounded in the domains of differential privacy and ensemble-learning research. Our method involves extracting and substituting depicted identities with synthetic ones, generated using variational mechanisms to maximize obfuscation and non-invertibility. Additionally, we leverage supervision from a mixture-of-experts to disentangle and preserve other utility attributes. We extensively evaluate our method using multiple datasets, demonstrating a higher de-identification rate and superior consistency compared to prior approaches in various downstream tasks.



Paperid:104
Authors:Bing Cao, Junliang Guo, Pengfei Zhu, Qinghua Hu
Tianjin University, Tianjin University, Tianjin University, Tianjin University
Abstract:
Due to the rapid development of computer vision, singlemodal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging sensor, multi-modal images (RGB, infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen, pre-trained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods. Our code is available: https://github.com/SparkTempest/BAT.



Paperid:105
Authors:Qinglong Cao, Zhengqin Xu, Yuntian Chen, Chao Ma, Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Abstract:
Large pretrained vision-language models, such as CLIP, have shown remarkable generalization capabilities across various tasks when appropriate text prompts are provided. However, adapting these models to specific domains, like remote sensing images (RSIs), medical images, etc, remains unexplored and challenging. Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms, leading to suboptimal performance due to the misinterpretation of specific images in natural image patterns. To tackle this dilemma, we proposed a Domain-Controlled Prompt Learning for the specific domains. Specifically, the large-scale specific domain foundation model (LSDM) is first introduced to provide essential specific domain knowledge. Using lightweight neural networks, we transfer this knowledge into domain biases, which control both the visual and language branches to obtain domain-adaptive prompts in a directly incorporating manner. Simultaneously, to overcome the existing overfitting challenge, we propose a novel noisy-adding strategy, without extra trainable parameters, to help the model escape the suboptimal solution in a global domain oscillation manner. Experimental results show our method achieves state-of-the-art performance in specific domain image recognition datasets. Our code is available at https://github.com/caoql98/DCPL.



Paperid:106
Authors:Yuxin Cao, Ziyu Zhao, Xi Xiao, Derui Wang, Minhui Xue, Jin Lu
Shenzhen International Graduate School, Tsinghua University, China, Fan Gongxiu Honors College, Beijing University of Technology, China, Shenzhen International Graduate School, Tsinghua University, China, CSIRO's Data61, Australia, CSIRO's Data61, Australia, Ping An Technology (Shenzhen) Co., Ltd., China
Abstract:
Video recognition systems are vulnerable to adversarial examples. Recent studies show that style transferbased and patch-based unrestricted perturbations can effectively improve attack efficiency. These attacks, however, face two main challenges: 1) Adding large stylized perturbations to all pixels reduces the naturalness of the video and such perturbations can be easily detected. 2) Patch-based video attacks are not extensible to targeted attacks due to the limited search space of reinforcement learning that has been widely used in video attacks recently. In this paper, we focus on the video black-box setting and propose a novel attack framework named LogoStyleFool by adding a stylized logo to the clean video. We separate the attack into three stages: style reference selection, reinforcement-learning-based logo style transfer, and perturbation optimization. We solve the first challenge by scaling down the perturbation range to a regional logo, while the second challenge is addressed by complementing an optimization stage after reinforcement learning. Experimental results substantiate the overall superiority of LogoStyleFool over three state-of-the-art patch-based attacks in terms of attack performance and semantic preservation. Meanwhile, LogoStyleFool still maintains its performance against two existing patch-based defense methods. We believe that our research is beneficial in increasing the attention of the security community to such subregional style transfer attacks.



Paperid:107
Authors:Junghun Cha, Ali Haider, Seoyun Yang, Hoeyeong Jin, Subin Yang, A. F. M. Shahab Uddin, Jaehyoung Kim, Soo Ye Kim, Sung-Ho Bae
Kyung Hee University, Kyung Hee University, Kyung Hee University, Kyung Hee University, Kyung Hee University, Jashore University of Science and Technology, Kyung Hee University, Adobe Research, Kyung Hee University
Abstract:
A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring highquality content from scanned copies has become an indispensable task for many products, it has not been systematically explored, and to the best of our knowledge, no public datasets are available. In this paper, we define this problem as Descanning and introduce a new high-quality and large-scale dataset named DESCAN-18K. It contains 18K pairs of original and scanned images collected in the wild containing multiple complex degradations. In order to eliminate such complex degradations, we propose a new image restoration model called DescanDiffusion consisting of a color encoder that corrects the global color degradation and a conditional denoising diffusion probabilistic model (DDPM) that removes local degradations. To further improve the generalization ability of DescanDiffusion, we also design a synthetic data generation scheme by reproducing prominent degradations in scanned images. We demonstrate that our DescanDiffusion outperforms other baselines including commercial restoration products, objectively and subjectively, via comprehensive experiments and analyses.



Paperid:108
Authors:Kennard Yanting Chan, Fayao Liu, Guosheng Lin, Chuan Sheng Foo, Weisi Lin
Nanyang Technological University, Singapore Institute for Infocomm Research, A*STAR, Singapore, Institute for Infocomm Research, A*STAR, Singapore, Nanyang Technological University, Singapore, Institute for Infocomm Research, A*STAR, Singapore Centre for Frontier AI Research, A*STAR, Singapore, Nanyang Technological University, Singapore
Abstract:
Pixelaligned implicit models, such as PIFu, PIFuHD, and ICON, are used for single-view clothed human reconstruction. These models need to be trained using a sampling training scheme. Existing sampling training schemes either fail to capture thin surfaces (e.g. ears, fingers) or cause noisy artefacts in reconstructed meshes. To address these problems, we introduce Fine Structured-Aware Sampling (FSS), a new sampling training scheme to train pixel-aligned implicit models for single-view human reconstruction. FSS resolves the aforementioned problems by proactively adapting to the thickness and complexity of surfaces. In addition, unlike existing sampling training schemes, FSS shows how normals of sample points can be capitalized in the training process to improve results. Lastly, to further improve the training process, FSS proposes a mesh thickness loss signal for pixel-aligned implicit models. It becomes computationally feasible to introduce this loss once a slight reworking of the pixel-aligned implicit function framework is carried out. Our results show that our methods significantly outperform SOTA methods qualitatively and quantitatively. Our code is publicly available at https://github.com/kcyt/FSS.



Paperid:109
Authors:Gyusam Chang, Wonseok Roh, Sujin Jang, Dongwook Lee, Daehyun Ji, Gyeongrok Oh, Jinsun Park, Jinkyu Kim, Sangpil Kim
Department of Artificial Intelligence, Korea University, Department of Artificial Intelligence, Korea University, Samsung Advanced Institute of Technology (SAIT), Samsung Advanced Institute of Technology (SAIT), Samsung Advanced Institute of Technology (SAIT), Department of Artificial Intelligence, Korea University, School of Computer Science and Engineering, Pusan National University, Department of Computer Science and Engineering, Korea University, Department of Artificial Intelligence, Korea University
Abstract:
Recent LiDARbased 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance.



Paperid:110
Authors:Qing Chang, Yifei Tong
Nanjing University of Science and Technology, Nanjing University of Science and Technology
Abstract:
Lane detection is a critical task in autonomous driving, which requires accurately predicting the complex topology of lanes in various scenarios. While previous methods of lane detection have shown success, challenges still exist, especially in scenarios where lane markings are absent. In this paper, we analyze the role of global and local features in accurately detecting lanes and propose a Hybrid GlobalLocal Perception Network (HGLNet) to leverage them. Global and local features play distinct roles in lane detection by respectively aiding in the detection of lane instances and the localization of corresponding lanes. HGLNet extracts global semantic context by utilizing a global extraction head that aggregates information about adaptive sampling points around lanes, achieving an optimal trade-off between performance and efficiency. Moreover, we introduce a Multi-hierarchy feature aggregator (MFA) to capture feature hierarchies in both regional and local ranges, elevating the representation of local features. The proposed Hybrid architecture can simultaneously focus on global and local features at different depth levels and efficiently integrate them to sense the global presence of lanes and accurately regress their locations. Experimental results demonstrate that our proposed method improves detection accuracy in various challenging scenarios, outperforming the state-of-the-art lane detection methods.



Paperid:111
Authors:Bo-Yu Chen, Wei-Chen Chiu, Yu-Lun Liu
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University
Abstract:
In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed lowrank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization. The source code is available at https://github.com/Nemo1999/Joint-TensoRF.



Paperid:112
Authors:Chaofeng Chen, Shangchen Zhou, Liang Liao, Haoning Wu, Wenxiu Sun, Qiong Yan, Weisi Lin
S-Lab, Nanyang Technological University, S-Lab, Nanyang Technological University, S-Lab, Nanyang Technological University, S-Lab, Nanyang Technological University, SenseTime Research, SenseTime Research, S-Lab, Nanyang Technological University
Abstract:
Realworld image super-resolution (RWSR) is a long-standing problem as low-quality (LQ) images often have complex and unidentified degradations. Existing methods such as Generative Adversarial Networks (GANs) or continuous diffusion models present their own issues including GANs being difficult to train while continuous diffusion models requiring numerous inference steps. In this paper, we propose an Iterative Token Evaluation and Refinement (ITER) framework for RWSR, which utilizes a discrete diffusion model operating in the discrete token representation space, i.e., indexes of features extracted from a VQGAN codebook pre-trained with high-quality (HQ) images. We show that ITER is easier to train than GANs and more efficient than continuous diffusion models. Specifically, we divide RWSR into two sub-tasks, i.e., distortion removal and texture generation. Distortion removal involves simple HQ token prediction with LQ images, while texture generation uses a discrete diffusion model to iteratively refine the distortion removal output with a token refinement network. In particular, we propose to include a token evaluation network in the discrete diffusion process. It learns to evaluate which tokens are good restorations and helps to improve the iterative refinement results. Moreover, the evaluation network can first check status of the distortion removal output and then adaptively select total refinement steps needed, thereby maintaining a good balance between distortion removal and texture generation. Extensive experimental results show that ITER is easy to train and performs well within just 8 iterative steps.



Paperid:113
Authors:Dalong Chen, Jianjia Zhang, Wei-Shi Zheng, Ruixuan Wang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China, School of Biomedical Engineering, Shenzhen Campus of Sun Yat-sen University, Guangdong, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China
Abstract:
Fewshot learning is a challenging task due to the limited availability of training samples. Recent few-shot learning studies with meta-learning and simple transfer learning methods have achieved promising performance. However, the feature extractor pre-trained with the upstream dataset may neglect the extraction of certain features which could be crucial for downstream tasks. In this study, inspired by the process of human learning in few-shot tasks, where humans not only observe the whole image (`global view') but also attend to various local image regions (`local view') for comprehensive understanding of detailed features, we propose a simple yet effective few-shot learning method called FeatWalk which can utilize the complementary nature of global and local views, therefore providing an intuitive and effective solution to the problem of insufficient local information extraction from the pre-trained feature extractor. Our method can be easily and flexibly combined with various existing methods, further enhancing few-shot learning performance. Extensive experiments on multiple benchmark datasets consistently demonstrate the effectiveness and versatility of our method.The source code is available at https://github.com/exceefind/FeatWalk.



Paperid:114
Authors:Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu
Meituan, Meituan, Meituan, State Key Laboratory of Computer Science, ISCAS University of Chinese Academy of Sciences University of Macau
Abstract:
Despite significant progress in utilizing pretrained text-to-image diffusion models to guide the creation of 3D scenes, these methods often struggle to generate scenes that are sufficiently realistic, leading to "neural scene degeneration". In this work, we propose a new 3D scene generation model called Real3D. Specifically, Real3D designs a pipeline from a NeRF-like implicit renderer to a tetrahedrons-based explicit renderer, greatly improving the neural network's ability to generate various neural scenes. Moreover, Real3D introduces an additional discriminator to prevent neural scenes from falling into undesirable local optima, thus avoiding the degeneration phenomenon. Our experimental results demonstrate that Real3D outperforms all existing state-of-the-art text-to-3D generation methods, providing valuable insights to facilitate the development of learning-based 3D scene generation approaches.



Paperid:115
Authors:Honghao Chen, Xiangwen Kong, Xiangyu Zhang, Xin Zhao, Kaiqi Huang
CRISE, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, MEGVII Technology, MEGVII Technology, CRISE, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, CRISE, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences CAS Center for Excellence in Brain Science and Intelligence Technology
Abstract:
Recently, masked image modeling (MIM) has demonstrated promising prospects in selfsupervised representation learning. However, existing MIM frameworks recover all masked patches equivalently, ignoring that the reconstruction difficulty of different patches can vary sharply due to their diverse distance from visible patches. In this paper, we propose a novel deep dynamic supervision to enable MIM methods to dynamically reconstruct patches with different degrees of difficulty at different pretraining phases and depths of the model. Our deep dynamic supervision helps to provide more locality inductive bias for ViTs especially in deep layers, which inherently makes up for the absence of local prior for self-attention mechanism. Built upon the deep dynamic supervision, we propose Deep Dynamic AutoEncoder (DDAE), a simple yet effective MIM framework that utilizes dynamic mechanisms for pixel regression and feature self-distillation simultaneously. Extensive experiments across a variety of vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on COCO demonstrate the effectiveness of our approach.



Paperid:116
Authors:Hongming Chen, Xiang Chen, Jiyang Lu, Yufeng Li
Shenyang Aerospace University, Nanjing University of Science and Technology, Shenyang Aerospace University, Shenyang Aerospace University
Abstract:
Existing Transformerbased image deraining methods depend mostly on fixed single-input single-output U-Net architecture. In fact, this not only neglects the potentially explicit information from multiple image scales, but also lacks the capability of exploring the complementary implicit information across different scales. In this work, we rethink the multi-scale representations and design an effective multi-input multi-output framework that constructs intra- and inter-scale hierarchical modulation to better facilitate rain removal and help image restoration. We observe that rain levels reduce dramatically in coarser image scales, thus proposing to restore rain-free results from the coarsest scale to the finest scale in image pyramid inputs, which also alleviates the difficulty of model learning. Specifically, we integrate a sparsity-compensated Transformer block and a frequency-enhanced convolutional block into a coupled representation module, in order to jointly learn the intra-scale content-aware features. To facilitate representations learned at different scales to communicate with each other, we leverage a gated fusion module to adaptively aggregate the inter-scale spatial-aware features, which are rich in correlated information of rain appearances, leading to high-quality results. Extensive experiments demonstrate that our model achieves consistent gains on five benchmarks.



Paperid:117
Authors:Hongyang Chen, Hung-Shuo Tai, Kaisheng Ma
Xi’an Jiaotong University, KargoBot.ai, Tsinghua University
Abstract:
Consumergrade cameras capture the RAW physical description of a scene and then process the image signals to obtain high-quality RGB images that are faithful to human visual perception. Conventionally, dense prediction scenes require high-precision recognition of objects in RGB images. However, predicting RGB data to exhibit the expected adaptability and robustness in harsh environments can be challenging. By capitalizing on the broader color gamut and higher bit depth offered by RAW data, in this paper, we demonstrate that RAW data can significantly improve the accuracy and robustness of object detectors in harsh environments. Firstly, we propose a general Pipeline for RAW Detection (PRD), along with a preprocessing strategy tailored to RAW data. Secondly, we design the RAW Corruption Benchmark (RCB) to address the dearth of benchmarks that reflect realistic scenarios in harsh environments. Thirdly, we demonstrate the significant improvement of RAW images in object detection for low-light and corrupt scenes. Specifically, our experiments indicate that PRD (using FCOS) outperforms RGB detection by 13.9mAP on LOD-Snow without generating restored images. Finally, we introduce a new nonlinear method called Functional Regularization (FR), which can effectively mine the unique characteristics of RAW data. The code is available at https://github.com/DreamerCCC/RawMining.



Paperid:118
Authors:Hongyang Chen, Kaisheng Ma
Xi’an Jiaotong University, Tsinghua University
Abstract:
Lowlevel vision plays a crucial role in a wide range of imaging quality and image recognition applications. However, the limited size, quality, and diversity of datasets often pose significant challenges for low-level tasks. Data augmentation is the most effective and practical way of sample expansion, but the commonly used augmentation methods in high-level tasks have limited improvement in the low-level due to the boundary effects or the non-realistic context information. In this paper, we propose the Cut-and-Swap Frequency Components (CutFreq) method for low-level vision, which aims to preserve high-level representations with directionality and improve image synthesis quality. Observing the significant frequency domain differences between reconstructed images and real ones, in CutFreq, we propose to transform the input and real images separately in the frequency domain, then define two stages for the model training process, and finally swap the specified frequency bands respectively and inversely transform to generate augmented samples. The experimental results show the superior performance of CutFreq on five low-level vision tasks. Moreover, we demonstrate the effectiveness of CutFreq in the low-data regime. Code is available at https://github.com/DreamerCCC/CutFreq.



Paperid:119
Authors:Jiacheng Chen, Jiawei Jiang, Fei Wu, Jianwei Zheng
Zhejiang University of Technology, Zhejiang University of Technology, Zhejiang University of Technology, Zhejiang University of Technology
Abstract:
Consistency and interpretability have long been the critical issues in MRI reconstruction. While interpretability has been dramatically improved with the employment of deep unfolding networks (DUNs), current methods still suffer from inconsistencies and generate inferior anatomical structure. Especially in multicontrast scenes, different imaging protocols often exacerbate the concerned issue. In this paper, we propose a range-null decomposition-assisted DUN architecture to ensure consistency while still providing desirable interpretability. Given the input decomposed, we argue that the inconsistency could be analytically relieved by feeding solely the null-space component into proximal mapping, while leaving the range-space counterpart fixed. More importantly, a correlation decoupling scheme is further proposed to narrow the information gap for multi-contrast fusion, which dynamically borrows isotropic features from the opponent while maintaining the modality-specific ones. Specifically, the two features are attached to different frequencies and learned individually by the newly designed isotropy encoder and anisotropy encoder. The former strives for the contrast-shared information, while the latter serves to capture the contrast-specific features. The quantitative and qualitative results show that our proposal outperforms most cutting-edge methods by a large margin. Codes will be released on https://github.com/chenjiachengzzz/RNU.



Paperid:120
Authors:Jiafu Chen, Wei Xing, Jiakai Sun, Tianyi Chu, Yiling Huang, Boyan Ji, Lei Zhao, Huaizhong Lin, Haibo Chen, Zhizhong Wang
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang Univiersity, Zhejiang University, Zhejiang University, Nanjing University of Science and Technology, Zhejiang University
Abstract:
3D scene stylization refers to transform the appearance of a 3D scene to match a given style image, ensuring that images rendered from different viewpoints exhibit the same style as the given style image, while maintaining the 3D consistency of the stylized scene. Several existing methods have obtained impressive results in stylizing 3D scenes. However, the models proposed by these methods need to be re-trained when applied to a new scene. In other words, their models are cou- pled with a specific scene and cannot adapt to arbitrary other scenes. To address this issue, we propose a novel 3D scene stylization framework to transfer an arbitrary style to an ar- bitrary scene, without any style-related or scene-related re- training. Concretely, we first map the appearance of the 3D scene into a 2D style pattern space, which realizes complete disentanglement of the geometry and appearance of the 3D scene and makes our model be generalized to arbitrary 3D scenes. Then we stylize the appearance of the 3D scene in the 2D style pattern space via a prompt-based 2D stylization al- gorithm. Experimental results demonstrate that our proposed framework is superior to SOTA methods in both visual qual- ity and generalization.



Paperid:121
Authors:Jiankang Chen, Tong Zhang, Wei-Shi Zheng, Ruixuan Wang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China, Peng Cheng Laboratory, Shenzhen, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China
Abstract:
Outof-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors') from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at https://github.com/Cverchen/TagFog.



Paperid:122
Authors:Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang
Sun Yat-sen University, Institute of Automation, Chinese Academy of Sciences (CASIA), Bytedance Inc, Bytedance Inc, Bytedance Inc, Sun Yat-sen University, Sun Yat-sen University
Abstract:
Building scalable visionlanguage models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 4x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.



Paperid:123
Authors:Kaitao Chen, Shiliang Sun, Jing Zhao
East China Normal University, East China Normal University, East China Normal University
Abstract:
Whole slide image (WSI) classification is a crucial component in automated pathology analysis. Due to the inherent challenges of highresolution WSIs and the absence of patch-level labels, most of the proposed methods follow the multiple instance learning (MIL) formulation. While MIL has been equipped with excellent instance feature extractors and aggregators, it is prone to learn spurious associations that undermine the performance of the model. For example, relying solely on color features may lead to erroneous diagnoses due to spurious associations between the disease and the color of patches. To address this issue, we develop a causal MIL framework for WSI classification, effectively distinguishing between causal and spurious associations. Specifically, we use the expectation of the intervention P(Y | do(X)) for bag prediction rather than the traditional likelihood P(Y | X). By applying the front-door adjustment, the spurious association is effectively blocked, where the intervened mediator is aggregated from patch-level features. We evaluate our proposed method on two publicly available WSI datasets, Camelyon16 and TCGA-NSCLC. Our causal MIL framework shows outstanding performance and is plug-and-play, seamlessly integrating with various feature extractors and aggregators.



Paperid:124
Authors:Lianggangxu Chen, Youqi Song, Yiqing Cai, Jiale Lu, Yang Li, Yuan Xie, Changbo Wang, Gaoqi He
East China Normal University, East China Normal University, East China Normal University, East China Normal University, East China Normal University, East China Normal University, East China Normal University, East China Normal University Chongqing Key Laboratory of Precision Optics
Abstract:
In the domain of scene graph generation, modeling commonsense as a singleprototype representation has been typically employed to facilitate the recognition of infrequent predicates. However, a fundamental challenge lies in the large intra-class variations of the visual appearance of predicates, resulting in subclasses within a predicate class. Such a challenge typically leads to the problem of misclassifying diverse predicates due to the rough predicate space clustering. In this paper, inspired by cognitive science, we maintain multi-prototype representations for each predicate class, which can accurately find the multiple class centers of the predicate space. Technically, we propose a novel multi-prototype learning framework consisting of three main steps: prototype-predicate matching, prototype updating, and prototype space optimization. We first design a triple-level optimal transport to match each predicate feature within the same class to a specific prototype. In addition, the prototypes are updated using momentum updating to find the class centers according to the matching results. Finally, we enhance the inter-class separability of the prototype space through iterations of the inter-class separability loss and intra-class compactness loss. Extensive evaluations demonstrate that our approach significantly outperforms state-of-the-art methods on the Visual Genome dataset.



Paperid:125
Authors:Lianggangxu Chen, Youqi Song, Shaohui Lin, Changbo Wang, Gaoqi He
East China Normal University, East China Normal University, East China Normal University, East China Normal University, East China Normal University Chongqing Key Laboratory of Precision Optics
Abstract:
Graph neural networks (GNNs) has demonstrated its capabilities in the field of scene graph generation (SGG) by updating node representations from neighboring nodes. Actually it can be viewed as a form of lowpass filter in the spatial domain, which smooths node feature representation and retains commonalities among nodes. However, spatial GNNs does not work well in the case of heterophilic SGG in which fine-grained predicates are always connected to a large number of coarse-grained predicates. Blind smoothing undermines the discriminative information of the fine-grained predicates, resulting in failure to predict them accurately. To address the heterophily, our key idea is to design tailored filters by wavelet transform from the spectral domain. First, we prove rigorously that when the heterophily on the scene graph increases, the spectral energy gradually shifts towards the high-frequency part. Inspired by this observation, we subsequently propose the Kumaraswamy Wavelet Graph Neural Network (KWGNN). KWGNN leverages complementary multi-group Kumaraswamy wavelets to cover all frequency bands. Finally, KWGNN adaptively generates band-pass filters and then integrates the filtering results to better accommodate varying levels of smoothness on the graph. Comprehensive experiments on the Visual Genome and Open Images datasets show that our method achieves state-of-the-art performance.



Paperid:126
Authors:Lin Chen, Zhijie Jia, Lechao Cheng, Yang Gao, Jie Lei, Yijun Bei, Zunlei Feng
Zhejiang University, Zhejiang University, Zhejiang Lab, Zhejiang University, Zhejiang University of Technology, Zhejiang University, Zhejiang University
Abstract:
A surge of interest has emerged in utilizing Transformers in diverse vision tasks owing to its formidable performance. However, existing approaches primarily focus on optimizing internal model architecture designs that often entail significant trial and error with high burdens. In this work, we propose a new paradigm dubbed Decision Stream Calibration that boosts the performance of general Vision Transformers. To achieve this, we shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions. Upon further analysis, it was discovered that 1) the final decision is associated with tokens of foreground targets, while token features of foreground target will be transmitted into the next layer as much as possible, and the useless token features of background area will be eliminated gradually in the forward propagation. 2) Each category is solely associated with specific sparse dimensions in the tokens. Based on the discoveries mentioned above, we designed a twostage calibration scheme, namely ViT-Calibrator, including token propagation calibration stage and dimension propagation calibration stage. Extensive experiments on commonly used datasets show that the proposed approach can achieve promising results.



Paperid:127
Authors:Linsheng Chen, Guangrun Wang, Liuchun Yuan, Keze Wang, Ken Deng, Philip H.S. Torr
Sun Yat-sen University, University of Oxford, Sun Yat-sen University, Sun Yat-sen University, Sun Yat-sen University, University of Oxford
Abstract:
Neural Radiance Fields (NeRF) have garnered remarkable success in novel view synthesis. Nonetheless, the task of generating highquality images for novel views persists as a critical challenge. While the existing efforts have exhibited commendable progress, capturing intricate details, enhancing textures, and achieving superior Peak Signal-to-Noise Ratio (PSNR) metrics warrant further focused attention and advancement. In this work, we propose NeRF-VPT, an innovative method for novel view synthesis to address these challenges. Our proposed NeRF-VPT employs a cascading view prompt tuning paradigm, wherein RGB information gained from preceding rendering outcomes serves as instructive visual prompts for subsequent rendering stages, with the aspiration that the prior knowledge embedded in the prompts can facilitate the gradual enhancement of rendered image quality. NeRF-VPT only requires sampling RGB data from previous stage renderings as priors at each training stage, without relying on extra guidance or complex techniques. Thus, our NeRF-VPT is plug-and-play and can be readily integrated into existing methods. By conducting comparative analyses of our NeRF-VPT against several NeRF-based approaches on demanding real-scene benchmarks, such as Realistic Synthetic 360, Real Forward-Facing, Replica dataset, and a user-captured dataset, we substantiate that our NeRF-VPT significantly elevates baseline performance and proficiently generates more high-quality novel view images than all the compared state-of-the-art methods. Furthermore, the cascading learning of NeRF-VPT introduces adaptability to scenarios with sparse inputs, resulting in a significant enhancement of accuracy for sparse-view novel view synthesis. The source code and dataset are available at https://github.com/Freedomcls/NeRF-VPT.



Paperid:128
Authors:Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting Chen, Qi Wu
The University of Adelaide, The University of Adelaide, The University of Adelaide, The University of Adelaide, The University of Adelaide, The University of Adelaide
Abstract:
Visionand-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contain rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the newWebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.



Paperid:129
Authors:Qihua Chen, Xuejin Chen, Chenxuan Wang, Yixiong Liu, Zhiwei Xiong, Feng Wu
University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
The current neuron reconstruction pipeline for electron microscopy (EM) data usually includes automatic image segmentation followed by extensive human expert proofreading. In this work, we aim to reduce human workload by predicting connectivity between oversegmented neuron pieces, taking both microscopy image and 3D morphology features into account, similar to human proofreading workflow. To this end, we first construct a dataset, named FlyTracing, that contains millions of pairwise connections of segments expanding the whole fly brain, which is three orders of magnitude larger than existing datasets for neuron segment connection. To learn sophisticated biological imaging features from the connectivity annotations, we propose a novel connectivity-aware contrastive learning method to generate dense volumetric EM image embedding. The learned embeddings can be easily incorporated with any point or voxel-based morphological representations for automatic neuron tracing. Extensive comparisons of different combination schemes of image and morphological representation in identifying split errors across the whole fly brain demonstrate the superiority of the proposed approach, especially for the locations that contain severe imaging artifacts, such as section missing and misalignment. The dataset and code are available at https://github.com/Levishery/Flywire-Neuron-Tracing.



Paperid:130
Authors:Siran Chen, Yue Ma, Yu Qiao, Yali Wang
University of Chinese Academy of Science Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Tsinghua University, Shanghai AI Laboratory Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shanghai AI Laboratory
Abstract:
3D perception is a critical problem in autonomous driving. Recently, the Bird’sEye-View (BEV) approach has attracted extensive attention, due to low-cost deployment and desirable vision detection capacity. However, the existing models ignore a realistic scenario during the driving procedure, i.e., one or more view cameras may be failed, which largely deteriorates their performance. To tackle this problem, we propose a generic Masked BEV (M-BEV) perception framework, which can effectively improve robustness to this challenging scenario, by random masking and reconstructing camera views in the end-to-end training. More specifically, we develop a novel Masked View Reconstruction (MVR) module in our M-BEV. It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision and reconstructs the masked ones with the distinct spatio-temporal context across camera views. Via such a plug-and-play MVR, our M-BEV is capable of learning the missing views from the resting ones, and thus well generalized for robust view recovery and accurate perception in the testing. We perform extensive experiments on the popular NuScenes benchmark, where our framework can significantly boost 3D perception performance of the state-of-the-art models on various missing view cases, e.g., for the absence of back view, our M-BEV promotes the PETRv2 model with 10.3% mAP gain.



Paperid:131
Authors:Taiyan Chen, Xianghua Ying, Jinfa Yang, Ruibin Wang, Ruohao Guo, Bowei Xing, Ji Shi
Peking University, Peking University, Peking University, Peking University, Peking University, Peking University, Peking University
Abstract:
In the field of vanishing point detection, previous works commonly relied on extracting and clustering straight lines or classifying candidate points as vanishing points. This paper proposes a novel endto-end framework, called VPDETR (Vanishing Point DEtection TRansformer), that views vanishing point detection as a set prediction problem, applicable to both Manhattan and non-Manhattan world datasets. By using the positional embedding of anchor points as queries in Transformer decoders and dynamically updating them layer by layer, our method is able to directly input images and output their vanishing points without the need for explicit straight line extraction and candidate points sampling. Additionally, we introduce an orthogonal loss and a cross-prediction loss to improve accuracy on the Manhattan world datasets. Experimental results demonstrate that VPDETR achieves competitive performance compared to state-of-the-art methods, without requiring post-processing.



Paperid:132
Authors:Tianxiang Chen, Zhentao Tan, Qi Chu, Yue Wu, Bin Liu, Nenghai Yu
School of Cyber Science and Technology, University of Science and Technology of China Alibaba Group Key Laboratory of Electromagnetic Space Information, Chinese Academy of Sciences, School of Cyber Science and Technology, University of Science and Technology of China Alibaba Group Key Laboratory of Electromagnetic Space Information, Chinese Academy of Sciences, School of Cyber Science and Technology, University of Science and Technology of China Key Laboratory of Electromagnetic Space Information, Chinese Academy of Sciences, Alibaba Group, School of Cyber Science and Technology, University of Science and Technology of China Key Laboratory of Electromagnetic Space Information, Chinese Academy of Sciences, School of Cyber Science and Technology, University of Science and Technology of China Key Laboratory of Electromagnetic Space Information, Chinese Academy of Sciences
Abstract:
Infrared small target detection (ISTD) is critical to national security and has been extensively applied in military areas. ISTD aims to segment small target pixels from background. Most ISTD networks focus on designing feature extraction blocks or feature fusion modules, but rarely describe the ISTD process from the feature map evolution perspective. In the ISTD process, the network attention gradually shifts towards target areas. We abstract this process as the directional movement of feature map pixels to target areas through convolution, pooling and interactions with surrounding pixels, which can be analogous to the movement of thermal particles constrained by surrounding variables and particles. In light of this analogy, we propose Thermal ConductionInspired Transformer (TCI-Former) based on the theoretical principles of thermal conduction. According to thermal conduction differential equation in heat dynamics, we derive the pixel movement differential equation (PMDE) in the image domain and further develop two modules: Thermal Conduction-Inspired Attention (TCIA) and Thermal Conduction Boundary Module (TCBM). TCIA incorporates finite difference method with PMDE to reach a numerical approximation so that target body features can be extracted. To further remove errors in boundary areas, TCBM is designed and supervised by boundary masks to refine target body features with fine boundary details. Experiments on IRSTD-1k and NUAA-SIRST demonstrate the superiority of our method.



Paperid:133
Authors:Xuanhong Chen, Hang Wang, Jialiang Chen, Kairui Feng, Jinfan Liu, Xiaohang Wang, Weimin Zhang, Bingbing Ni
Shanghai Jiao Tong University USC-SJTU Institute of Cultural and Creative Industry, Huawei, Shanghai Jiao Tong University USC-SJTU Institute of Cultural and Creative Industry, National Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University USC-SJTU Institute of Cultural and Creative Industry, Shanghai Jiao Tong University USC-SJTU Institute of Cultural and Creative Industry
Abstract:
Depth map superresolution (DSR) plays an indispensable role in 3D vision. We discover an non-trivial spectral phenomenon: the components of high-resolution (HR) and low-resolution (LR) depth maps manifest the same intrinsic phase, and the spectral phase of RGB is a superset of them, which suggests that a phase-aware filter can assist in the precise use of RGB cues. Motivated by this, we propose an intrinsic phase-preserving DSR paradigm, named IPPNet, to fully exploit inter-modality collaboration in a mutually guided way. In a nutshell, a novel Phase-Preserving Filtering Module (PPFM) is developed to generate dynamic phase-aware filters according to the LR depth flow to filter out erroneous noisy components contained in RGB and then conduct depth enhancement via the modulation of the phase-preserved RGB signal. By stacking multiple PPFM blocks, the proposed IPPNet is capable of reaching a highly competitive restoration performance. Extensive experiments on various benchmark datasets, e.g., NYU v2, RGB-D-D, reach SOTA performance and also well demonstrate the validity of the proposed phase-preserving scheme. Code: https://github.com/neuralchen/IPPNet/.



Paperid:134
Authors:Xuyang Chen, Dong Wang, Konrad Schindler, Mingwei Sun, Yongliang Wang, Nicolo Savioli, Liqiu Meng
Riemann Lab, Huawei Technical University of Munich, Riemann Lab, Huawei, ETH Zurich, Riemman Lab, Huawei Wuhan University, Riemman Lab, Huawei, Riemann Lab, Huawei, Technical University of Munich
Abstract:
Recently, Transformerbased text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising detection results. Simultaneously, the leverage of instance-level feature proposal substantially enhances memory efficiency ( > 50% less vs. the SOTA method DPText-DETR) and reduces inference speed (> 40% less vs. DPText-DETR) with comparable performance on benchmarks. The code is available at https://github.com/Albertchen98/Box2Poly.git.



Paperid:135
Authors:Yanzhe Chen, Huasong Zhong, Xiangteng He, Yuxin Peng, Jiahuan Zhou, Lele Cheng
Peking University, Kuaishou Technology, Peking University, Peking University, Peking University, Kuaishou Technology
Abstract:
The goal of composed fashion image retrieval is to locate a target image based on a reference image and modified text. Recent methods utilize symmetric encoders (e.g., CLIP) pretrained on large-scale non-fashion datasets. However, the input for this task exhibits an asymmetric nature, where the reference image contains rich content while the modified text is often brief. Therefore, methods employing symmetric encoders encounter a severe phenomenon: retrieval results dominated by reference images, leading to the oversight of modified text. We propose a Fashion Enhance-and-Refine Network (FashionERN) centered around two aspects: enhancing the text encoder and refining visual semantics. We introduce a Triple-branch Modifier Enhancement model, which injects relevant information from the reference image and aligns the modified text modality with the target image modality. Furthermore, we propose a Dual-guided Vision Refinement model that retains critical visual information through text-guided refinement and self-guided refinement processes. The combination of these two models significantly mitigates the reference dominance phenomenon, ensuring accurate fulfillment of modifier requirements. Comprehensive experiments demonstrate our approach's state-of-the-art performance on four commonly used datasets.



Paperid:136
Authors:Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin
S-Lab, Nanyang Technological University School of Computer Science and Engineering, Nanyang Technological University, Tencent PCG, China, School of Computer Science and Engineering, Nanyang Technological University, SenseTime Research, Tencent PCG, China, SenseTime Research, S-Lab, Nanyang Technological University School of Computer Science and Engineering, Nanyang Technological University
Abstract:
Recent strides in Textto-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.



Paperid:137
Authors:Yujun Chen, Xin Tan, Zhizhong Zhang, Yanyun Qu, Yuan Xie
School of Computer Science and Technology, East China Normal University Chongqing Institute, East China Normal University, School of Computer Science and Technology, East China Normal University Chongqing Institute, East China Normal University, School of Computer Science and Technology, East China Normal University Chongqing Institute, East China Normal University, School of Information Science and Engineering, Xiamen University, School of Computer Science and Technology, East China Normal University Chongqing Institute, East China Normal University
Abstract:
As the exorbitant expense of labeling autopilot datasets and the growing trend of utilizing unlabeled data, semisupervised segmentation on point clouds becomes increasingly imperative. Intuitively, finding out more ``unspoken words'' (i.e., latent instance information) beyond the label itself should be helpful to improve performance. In this paper, we discover two types of latent labels behind the displayed label embedded in LiDAR and image data. First, in the LiDAR Branch, we propose a novel augmentation, Cylinder-Mix, which is able to augment more yet reliable samples for training. Second, in the Image Branch, we propose the Instance Position-scale Learning (IPSL) Module to learn and fuse the information of instance position and scale, which is from a 2D pre-trained detector and a type of latent label obtained from 3D to 2D projection. Finally, the two latent labels are embedded into the multi-modal panoptic segmentation network. The ablation of the IPSL module demonstrates its robust adaptability, and the experiments evaluated on SemanticKITTI and nuScenes demonstrate that our model outperforms the state-of-the-art method, LaserMix.



Paperid:138
Authors:Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Zhiqing Sun, Dan Gutfreund, Chuang Gan
MIT-IBM Watson AI Lab, UMass Amherst, MIT-IBM Watson AI Lab, University of California, Los Angeles, Carnegie Mellon University, MIT-IBM Watson AI Lab, MIT-IBM Watson AI Lab UMass Amherst
Abstract:
Knowledgebased visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as two independent modules, failing to attend to both modules throughout all stages of reasoning. To this end, we propose Visual Chain-of-thought Prompting (VCTP) for knowledge-based reasoning, which involves the interaction between visual content and natural language in an iterative step-by-step reasoning manner. VCTP contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to key visual concepts from natural language questions adaptively. It then transforms key visual context into text context for prompting with a visual captioning model, and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, which is then passed through a cross-modality classifier to verify that it’s consistent with the visual context. We iterate through the think-confirm stages to ensure the verified rationale is consistent with the answer. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines. Our code is available at https://github.com/UMass-Foundation-Model/VisualCoT.git



Paperid:139
Authors:Zhengrui Chen, Liying Lu, Ziyang Yuan, Yiming Zhu, Yu Li, Chun Yuan, Weihong Deng
Beijing University of Posts and Telecommunications, International Digital Economy Academy, Tsinghua University, Tsinghua University International Digital Economy Academy, International Digital Economy Academy, Tsinghua University, Beijing University of Posts and Telecommunications
Abstract:
Blind face restoration under extreme conditions involves reconstructing highquality face images from severely degraded inputs. These input images are often in poor quality and have extreme facial poses, leading to errors in facial structure and unnatural artifacts within the restored images. In this paper, we show that utilizing 3D priors effectively compensates for structure knowledge deficiencies in 2D priors while preserving the texture details. Based on this, we introduce FREx (Face Restoration under Extreme conditions) that combines structure-accurate 3D priors and texture-rich 2D priors in pretrained generative networks for blind face restoration under extreme conditions. To fuse the different information in 3D and 2D priors, we introduce an adaptive weight module that adjusts the importance of features based on the input image's condition. With this approach, our model can restore structure-accurate and natural-looking faces even when the images have lost a lot of information due to degradation and extreme pose. Extensive experimental results on synthetic and real-world datasets validate the effectiveness of our methods.



Paperid:140
Authors:Zhongxi Chen, Ke Sun, Xianming Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Abstract:
Camouflaged Object Detection (COD) is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. Existing COD methods struggle with nuanced object boundaries and overconfident incorrect predictions. In response, we propose a new paradigm that treats COD as a conditional maskgeneration task leveraging diffusion models. Our method, dubbed CamoDiffusion, employs the denoising process to progressively refine predictions while incorporating image conditions. Due to the stochastic sampling process of diffusion, our model is capable of sampling multiple possible predictions, avoiding the problem of overconfident point estimation. Moreover, we develop specialized network architecture, training, and sampling strategies, to enhance the model’s expressive power, refinement capabilities and suppress overconfident mis-segmentations, thus aptly tailoring the diffusion model to the demands of COD. Extensive experiments on three COD datasets attest to the superior performance of our model compared to existing state-of-the-art methods, particularly on the most challenging COD10K dataset, where our approach achieves 0.019 in terms of MAE. Codes and models are available at https://github.com/Rapisurazurite/CamoDiffusion.



Paperid:141
Authors:Zhuowei Chen, Shancheng Fang, Wei Liu, Qian He, Mengqi Huang, Zhendong Mao
University of Science and Technology of China ByteDance, University of Science and Technology of China, ByteDance, ByteDance, University of Science and Technology of China, University of Science and Technology of China
Abstract:
While largescale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity and follow the text prompts simultaneously for conditioned input face images and texts. Despite existing encoder-based methods achieving high efficiency and decent face similarity, the generated image often fails to follow the textual prompts. To ease this editability issue, we present DreamIdentity, to learn edit-friendly and accurate face-identity representations in the word embedding space. Specifically, we propose self-augmented editability learning to enhance the editability for projected embedding, which is achieved by constructing paired generated celebrity's face and edited celebrity images for training, aiming at transferring mature editability of off-the-shelf text-to-image models in celebrity to unseen identities. Furthermore, we design a novel dedicated face-identity encoder to learn an accurate representation of human faces, which applies multi-scale ID-aware features followed by a multi-embedding projector to generate the pseudo words in the text embedding space directly. Extensive experiments show that our method can generate more text-coherent and ID-preserved images with negligible time overhead compared to the standard text-to-image generation process.



Paperid:142
Authors:Zida Chen, Ziran Zhang, Haoying Li, Menghao Li, Yueting Chen, Qi Li, Huajun Feng, Zhihai Xu, Shiqi Chen
State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University; Shanghai Artificial Intelligence Laboratory, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University
Abstract:
Linear Array Pushbroom (LAP) imaging technology is widely used in the realm of remote sensing. However, images acquired through LAP always suffer from distortion and blur because of camera jitter. Traditional methods for restoring LAP images, such as algorithms estimating the point spread function (PSF), exhibit limited performance. To tackle this issue, we propose a JitterAware Restoration Network (JARNet), to remove the distortion and blur in two stages. In the first stage, we formulate an Optical Flow Correction (OFC) block to refine the optical flow of the degraded LAP images, resulting in pre-corrected images where most of the distortions are alleviated. In the second stage, for further enhancement of the pre-corrected images, we integrate two jitter-aware techniques within the Spatial and Frequency Residual (SFRes) block: 1) introducing Coordinate Attention (CoA) to the SFRes block in order to capture the jitter state in orthogonal direction; 2) manipulating image features in both spatial and frequency domains to leverage local and global priors. Additionally, we develop a data synthesis pipeline, which applies Continue Dynamic Shooting Model (CDSM) to simulate realistic degradation in LAP images. Both the proposed JARNet and LAP image synthesis pipeline establish a foundation for addressing this intricate challenge. Extensive experiments demonstrate that the proposed two-stage method outperforms state-of-the-art image restoration models. Code is available at https://github.com/JHW2000/JARNet.



Paperid:143
Authors:Ri Cheng, Ruian He, Xuhao Jiang, Shili Zhou, Weimin Tan, Bo Yan
School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
Abstract:
Existing recurrent optical flow estimation networks are computationally expensive since they use a fixed large number of iterations to update the flow field for each sample. An efficient network should skip iterations when the flow improvement is limited. In this paper, we develop a ContextAware Iteration Policy Network for efficient optical flow estimation, which determines the optimal number of iterations per sample. The policy network achieves this by learning contextual information to realize whether flow improvement is bottlenecked or minimal. On the one hand, we use iteration embedding and historical hidden cell, which include previous iterations information, to convey how flow has changed from previous iterations. On the other hand, we use the incremental loss to make the policy network implicitly perceive the magnitude of optical flow improvement in the subsequent iteration. Furthermore, the computational complexity in our dynamic network is controllable, allowing us to satisfy various resource preferences with a single trained model. Our policy network can be easily integrated into state-of-the-art optical flow networks. Extensive experiments show that our method maintains performance while reducing FLOPs by about 40%/20% for the Sintel/KITTI datasets.



Paperid:144
Authors:Weihao Cheng, Yan-Pei Cao, Ying Shan
Tencent, Tencent, Tencent
Abstract:
We study to generate novel views of indoor scenes given sparse input views. The challenge is to achieve both photorealism and view consistency. We present SparseGNV: a learning framework that incorporates 3D structures and image generative models to generate novel views with three modules. The first module builds a neural point cloud as underlying geometry, providing scene context and guidance for the target novel view. The second module utilizes a transformerbased network to map the scene context and the guidance into a shared latent space and autoregressively decodes the target view in the form of discrete image tokens. The third module reconstructs the tokens back to the image of the target view. SparseGNV is trained across a large-scale indoor scene dataset to learn generalizable priors. Once trained, it can efficiently generate novel views of an unseen indoor scene in a feed-forward manner. We evaluate SparseGNV on real-world indoor scenes and demonstrate that it outperforms state-of-the-art methods based on either neural radiance fields or conditional image generation.



Paperid:145
Authors:Yean Cheng, Renjie Wan, Shuchen Weng, Chengxuan Zhu, Yakun Chang, Boxin Shi
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Engineering Research Center of Visual Technology, School of Computer Science, Peking University AI Innovation Center, School of Computer Science, Peking University, Department of Computer Science, Hong Kong Baptist University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, National Key Laboratory of General AI, School of Intelligence Science and Technology, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Engineering Research Center of Visual Technology, School of Computer Science, Peking University AI Innovation Center, School of Computer Science, Peking University
Abstract:
Though Neural Radiance Fields (NeRF) can produce colorful 3D representations of the world by using a set of 2D images, such ability becomes nonexistent when only monochromatic images are provided. Since color is necessary in representing the world, reproducing color from monochromatic radiance fields becomes crucial. To achieve this goal, instead of manipulating the monochromatic radiance fields directly, we consider it as a representation-prediction task in the Lab color space. By first constructing the luminance and density representation using monochromatic images, our prediction stage can recreate color representation on the basis of an image colorization module. We then reproduce a colorful implicit model through the representation of luminance, density, and color. Extensive experiments have been conducted to validate the effectiveness of our approaches. Our project page: https://liquidammonia.github.io/color-nerf.



Paperid:146
Authors:Zesen Cheng, Kehan Li, Peng Jin, Siheng Li, Xiangyang Ji, Li Yuan, Chang Liu, Jie Chen
School of Electronic and Computer Engineering, Peking University, Shenzhen, China, School of Electronic and Computer Engineering, Peking University, Shenzhen, China, School of Electronic and Computer Engineering, Peking University, Shenzhen, China, Tsinghua University, Beijing, China, Tsinghua University, Beijing, China, School of Electronic and Computer Engineering, Peking University, Shenzhen, China, Tsinghua University, Beijing, China, School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Abstract:
Unified visual grounding (UVG) capitalizes on a wealth of taskrelated knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks.



Paperid:147
Authors:Dongmin Choi, Wonwoo Cho, Kangyeol Kim, Jaegul Choo
Letsur Inc. Korea Advanced Institute of Science and Technology, Letsur Inc. Korea Advanced Institute of Science and Technology, Letsur Inc. Korea Advanced Institute of Science and Technology, Letsur Inc. Korea Advanced Institute of Science and Technology
Abstract:
Accurately annotating multiple 3D objects in LiDAR scenes is laborious and challenging. While a few previous studies have attempted to leverage semiautomatic methods for cost-effective bounding box annotation, such methods have limitations in efficiently handling numerous multi-class objects. To effectively accelerate 3D annotation pipelines, we propose iDet3D, an efficient interactive 3D object detector. Supporting a user-friendly 2D interface, which can ease the cognitive burden of exploring 3D space to provide click interactions, iDet3D enables users to annotate the entire objects in each scene with minimal interactions. Taking the sparse nature of 3D point clouds into account, we design a negative click simulation (NCS) to improve accuracy by reducing false-positive predictions. In addition, iDet3D incorporates two click propagation techniques to take full advantage of user interactions: (1) dense click guidance (DCG) for keeping user-provided information throughout the network and (2) spatial click propagation (SCP) for detecting other instances of the same class based on the user-specified objects. Through our extensive experiments, we present that our method can construct precise annotations in a few clicks, which shows the practicality as an efficient annotation tool for 3D object detection.



Paperid:148
Authors:Jae-Ho Choi, Ki-Bong Kang, Kyung-Tae Kim
Stanford University, Samsung Electronics, Pohang University of Science and Technology
Abstract:
Remote physiology, which involves monitoring vital signs without the need for physical contact, has great potential for various applications. Current remote physiology methods rely only on a single camera or radio frequency (RF) sensor to capture the microscopic signatures from vital movements. However, our study shows that fusing deep RGB and RF features from both sensor streams can further improve performance. Because these multimodal features are defined in distinct dimensions and have varying contextual importance, the main challenge in the fusion process lies in the effective alignment of them and adaptive integration of features under dynamic scenarios. To address this challenge, we propose a novel vital sensing model, named FusionVital, that combines the RGB and RF modalities through the new introduction of pairwise input formats and transformer-based fusion strategies. We also perform comprehensive experiments based on a newly collected and released remote vital dataset comprising synchronized video-RF sensors, showing the superiority of the fusion approach over the previous single-sensor baselines in various aspects.



Paperid:149
Authors:Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen
Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan (R.O.C.), Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan (R.O.C.), Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan (R.O.C.), Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan (R.O.C.)
Abstract:
This study introduces an efficient and effective method, MeDM, that utilizes pretrained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Our project page can be found at https://medm2023.github.io



Paperid:150
Authors:Tianyi Chu, Wei Xing, Jiafu Chen, Zhizhong Wang, Jiakai Sun, Lei Zhao, Haibo Chen, Huaizhong Lin
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Nanjing University of Science and Technology, Zhejiang University
Abstract:
Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as largemask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.



Paperid:151
Authors:Marcos V. Conde, Javier Vazquez-Corral, Michael S. Brown, Radu Timofte
University of Würzburg, Computer Vision Center (CVC) Universitat Autònoma de Barcelona, York University, University of Würzburg
Abstract:
3D lookup tables (3D LUTs) are a key component for image enhancement. Modern image signal processors (ISPs) have dedicated support for these as part of the camera rendering pipeline. Cameras typically provide multiple options for picture styles, where each style is usually obtained by applying a unique handcrafted 3D LUT. Current approaches for learning and applying 3D LUTs are notably fast, yet not so memoryefficient, as storing multiple 3D LUTs is required. For this reason and other implementation limitations, their use on mobile devices is less popular. In this work, we propose a Neural Implicit LUT (NILUT), an implicitly defined continuous 3D color transformation parameterized by a neural network. We show that NILUTs are capable of accurately emulating real 3D LUTs. Moreover, a NILUT can be extended to incorporate multiple styles into a single network with the ability to blend styles implicitly. Our novel approach is memory-efficient, controllable and can complement previous methods, including learned ISPs. Code at https://github.com/mv-lab/nilut



Paperid:152
Authors:Cong Cong, Shiyu Xuan, Sidong Liu, Shiliang Zhang, Maurice Pagnucco, Yang Song
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China, Australian Institute of Health Innovation, Macquarie University, Sydney, Australia, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Abstract:
When training on a longtailed dataset, conventional learning algorithms tend to exhibit a bias towards classes with a larger sample size. Our investigation has revealed that this biased learning tendency originates from the model parameters, which are trained to disproportionately contribute to the classes characterised by their sample size (e.g., many, medium, and few classes). To balance the overall parameter contribution across all classes, we investigate the importance of each model parameter to the learning of different class groups, and propose a multistage parameter Decouple and Optimisation (DO) framework that decouples parameters into different groups with each group learning a specific portion of classes. To optimise the parameter learning, we apply different training objectives with a collaborative optimisation step to learn complementary information about each class group. Extensive experiments on long-tailed datasets, including CIFAR100, Places-LT, ImageNet-LT, and iNaturaList 2018, show that our framework achieves competitive performance compared to the state-of-the-art.



Paperid:153
Authors:Xiaofeng Cong, Jie Gui, Junming Hou
School of Cyber Science and Engineering, Southeast University, China, School of Cyber Science and Engineering, Southeast University, China Engineering Research Center of Blockchain Application, Supervision And Management, Southeast University, China Ministry of Education, Purple Mountain Laboratories, Nanjing, China, School of Information Science and Engineering, Southeast University, China
Abstract:
Due to the wavelength dependent light attenuation and scattering, the color of the underwater organism usually appears distorted. The existing underwater image enhancement methods mainly focus on designing networks capable of generating enhanced underwater organisms with fixed color. Due to the complexity of the underwater environment, ground truth labels are difficult to obtain, which results in the nonexistence of perfect enhancement effects. Different from the existing methods, this paper proposes an algorithm with color enhancement and color fine-tuning (CECF) capabilities. The color enhancement behavior of CECF is the same as that of existing methods, aiming to restore the color of the distorted underwater organism. Beyond this general purpose, the color fine-tuning behavior of CECF can adjust the color of organisms in a controlled manner, which can generate enhanced organisms with diverse colors. To achieve this purpose, four processes are used in CECF. A supervised enhancement process learns the mapping from a distorted image to an enhanced image by the decomposition of color code. A self reconstruction process and a cross-reconstruction process are used for content-invariant learning. A color fine-tuning process is designed based on the guidance for obtaining various enhanced results with different colors. Experimental results have proven the enhancement ability and color fine-tuning ability of the proposed CECF. The source code is provided in https://github.com/Xiaofeng-life/CECF.



Paperid:154
Authors:Mengyao Cui, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li
The University of Hong Kong Shanghai AI Laboratory, Shanghai AI Laboratory, Shanghai AI Laboratory, Shanghai AI Laboratory Northwestern Polytechnical University, Shanghai AI Laboratory Northwestern Polytechnical University
Abstract:
Singleexposure high dynamic range (HDR) imaging aims to reconstruct the wide-range intensities of a scene by using its single low dynamic range (LDR) image, thus providing significant efficiency. Existing methods pay high attention to restoring the luminance by inversing the tone-mapping process, while the color in the over-/under-exposed area cannot be well restored due to the information loss of the single LDR image. To address this issue, we introduce color events into the imaging pipeline, which record asynchronous pixel-wise color changes in a high dynamic range, enabling edge-like scene perception under challenging lighting conditions. Specifically, we propose a joint framework that incorporates color events and a single LDR image to restore both content and color of an HDR image, where an exposureaware transformer (EaT) module is designed to propagate the informative hints, provided by the normal-exposed LDR regions and the event streams, to the missing areas. In this module, an exposure-aware mask is estimated to suppress distractive information and strengthen the restoration of the over-/under-exposed regions. To our knowledge, we are the first to use color events to enhance single-exposure HDR imaging. We also contribute corresponding datasets, consisting of synthesized datasets and a real-world dataset collected by a DAVIS346-color camera. The datasets can be found at https://www.kaggle.com/datasets/mengyaocui/ce-hdr. Extensive experiments demonstrate the effectiveness of the proposed method.



Paperid:155
Authors:Wenting Cui, Runzhao Yao, Shaoyi Du
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong Unviersity
Abstract:
Fragment assembly involves restoring broken objects to their original geometries, and has many applications, such as archaeological restoration. Existing learning based frameworks have shown potential for solving part assembly problems with semantic decomposition, but cannot handle such geometrical decomposition problems. In this work, we propose a novel assembly framework, proxy level hybrid Transformer, with the core idea of using a hybrid graph to model and reason complex structural relationships between patches of fragments, dubbed as proxies. To this end, we propose a hybrid attention module, composed of intra and inter attention layers, enabling capturing of crucial contextual information within fragments and relative structural knowledge across fragments. Furthermore, we propose an adjacency aware hierarchical pose estimator, exploiting a decompose and integrate strategy. It progressively predicts adjacent probability and relative poses between fragments, and then implicitly infers their absolute poses by dynamic information integration. Extensive experimental results demonstrate that our method effectively reduces assembly errors while maintaining fast inference speed. The code is available at https://github.com/521piglet/PHFormer.



Paperid:156
Authors:Xiaohan Cui, Long Ma, Tengyu Ma, Jinyuan Liu, Xin Fan, Risheng Liu
Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology
Abstract:
Object detection in lowlight scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector. Different from existing works, we extend the illumination-based enhancers (our newly designed or existing) as a scene decomposition module, whose removed illumination is exploited as the auxiliary in the detector for extracting detection-friendly features. A semantic aggregation module is further established for integrating multi-scale scene-related semantic information in the context space. Actually, our built scheme successfully transforms the "trash" (i.e., the ignored illumination in the detector) into the "treasure" for the detector. Plenty of experiments are conducted to reveal our superiority against other state-of-the-art methods. The code will be public if it is accepted.



Paperid:157
Authors:Yuning Cui, Wenqi Ren, Alois Knoll
Technical University of Munich, Shenzhen Campus of Sun Yat-sen University, Technical University of Munich
Abstract:
Image restoration aims to reconstruct a highquality image from a degraded low-quality observation. Recently, Transformer models have achieved promising performance on image restoration tasks due to their powerful ability to model long-range dependencies. However, the quadratically growing complexity with respect to the input size makes them inapplicable to practical applications. In this paper, we develop an efficient convolutional network for image restoration by enhancing multi-scale representation learning. To this end, we propose an omni-kernel module that consists of three branches, i.e., global, large, and local branches, to learn global-to-local feature representations efficiently. Specifically, the global branch achieves a global perceptive field via the dual-domain channel attention and frequency-gated mechanism. Furthermore, to provide multi-grained receptive fields, the large branch is formulated via different shapes of depth-wise convolutions with unusually large kernel sizes. Moreover, we complement local information using a point-wise depth-wise convolution. Finally, the proposed network, dubbed OKNet, is established by inserting the omni-kernel module into the bottleneck position for efficiency. Extensive experiments demonstrate that our network achieves state-of-the-art performance on 11 benchmark datasets for three representative image restoration tasks, including image dehazing, image desnowing, and image defocus deblurring. The code is available at https://github.com/c-yn/OKNet.



Paperid:158
Authors:Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada
The University of Tokyo Shanghai AI Laboratory, RIKEN AIP The University of Tokyo, Shanghai AI Laboratory, University of Oxford, Shanghai AI Laboratory, The University of Tokyo RIKEN AIP
Abstract:
The standard Neural Radiance Fields (NeRF) paradigm employs a viewercentered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points. This simplified rendering approach presents challenges in accurately modeling images captured under adverse lighting conditions, such as low light or over-exposure. Motivated by the ancient Greek emission theory that posits visual perception as a result of rays emanating from the eyes, we slightly refine the conventional NeRF framework to train NeRF under challenging light conditions and generate normal-light condition novel views unsupervisedly. We introduce the concept of a ``Concealing Field," which assigns transmittance values to the surrounding air to account for illumination effects. In dark scenarios, we assume that object emissions maintain a standard lighting level but are attenuated as they traverse the air during the rendering process. Concealing Field thus compel NeRF to learn reasonable density and colour estimations for objects even in dimly lit situations. Similarly, the Concealing Field can mitigate over-exposed emissions during rendering stage. Furthermore, we present a comprehensive multi-view dataset captured under challenging illumination conditions for evaluation. Our code and proposed dataset are available at https://github.com/cuiziteng/Aleth-NeRF.



Paperid:159
Authors:Qian Dai, Dong Wei, Hong Liu, Jinghan Sun, Liansheng Wang, Yefeng Zheng
School of informatics, Xiamen University, Xiamen, China, Jarvis Research Center, Tencent Youtu Lab / Tencent Healthcare (Shenzhen) Co., Ltd., Shenzhen, China, Jarvis Research Center, Tencent Youtu Lab / Tencent Healthcare (Shenzhen) Co., Ltd., Shenzhen, China School of Medicine, Xiamen University, Xiamen, China, Jarvis Research Center, Tencent Youtu Lab / Tencent Healthcare (Shenzhen) Co., Ltd., Shenzhen, China School of Medicine, Xiamen University, Xiamen, China, School of informatics, Xiamen University, Xiamen, China, Jarvis Research Center, Tencent Youtu Lab / Tencent Healthcare (Shenzhen) Co., Ltd., Shenzhen, China
Abstract:
Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, it is not uncommon that some FL participants only possess a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants’ data. In addition, each participant would expect to obtain a personalized model tailored for its local data characteristics from the FL in such a scenario. In this work, we propose a new FL framework with federated modality-specific encoders and multimodal anchors (FedMEMA) to simultaneously address the two concurrent issues. Above all, FedMEMA employs an exclusive encoder for each modality to account for the inter-modal heterogeneity in the first place. In the meantime, while the encoders are shared by the participants, the decoders are personalized to meet individual needs. Specifically, a server with full-modal data employs a fusion decoder to aggregate and fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation reversely. Meanwhile, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the encoder parameters. On the other end, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up the information loss due to absent modalities while adapting the representations of present ones. FedMEMA is validated on the BraTS 2020 benchmark for multimodal brain tumor segmentation. Results show that it outperforms various up-to-date methods for multimodal and personalized FL and that its novel designs are effective. Our code is available.



Paperid:160
Authors:Songmin Dai, Yifan Wu, Xiaoqiang Li, Xiangyang Xue
Shanghai University, Shanghai University, Shanghai University, Fudan University
Abstract:
Recent unsupervised anomaly detection methods often rely on feature extractors pretrained with auxiliary datasets or on wellcrafted anomaly-simulated samples. However, this might limit their adaptability to an increasing set of anomaly detection tasks due to the priors in the selection of auxiliary datasets or the strategy of anomaly simulation. To tackle this challenge, we first introduce a prior-less anomaly generation paradigm and subsequently develop an innovative unsupervised anomaly detection framework named GRAD, grounded in this paradigm. GRAD comprises three essential components: (1) a diffusion model (PatchDiff) to generate contrastive patterns by preserving the local structures while disregarding the global structures present in normal images, (2) a self-supervised reweighting mechanism to handle the challenge of long-tailed and unlabeled contrastive patterns generated by PatchDiff, and (3) a lightweight patch-level detector to efficiently distinguish the normal patterns and reweighted contrastive patterns. The generation results of PatchDiff effectively expose various types of anomaly patterns, e.g. structural and logical anomaly patterns. In addition, extensive experiments on both MVTec AD and MVTec LOCO datasets also support the aforementioned observation and demonstrate that GRAD achieves competitive anomaly detection accuracy and superior inference speed.



Paperid:161
Authors:Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Xiaojun Chang, Jingdong Wang
School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University, School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University, School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University, SGIT AI Lab State Grid Corporation of China, University of Technology Sydney Mohamed bin Zayed University of Artificial Intelligence, Baidu
Abstract:
Crossmodal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, i.e., noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on similarity-guided training with hard negatives and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely Self-Reinforcing Errors Mitigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.



Paperid:162
Authors:Duolikun Danier, Fan Zhang, David Bull
University of Bristol, University of Bristol, University of Bristol
Abstract:
Existing works on video frame interpolation (VFI) mostly employ deep neural networks that are trained by minimizing the L1, L2, or deep feature space distance (e.g. VGG loss) between their outputs and groundtruth frames. However, recent works have shown that these metrics are poor indicators of perceptual VFI quality. Towards developing perceptually-oriented VFI methods, in this work we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method on common test sets used in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with favorable perceptual quality compared to the state of the art, even in the high-resolution regime. Our code is available at https://github.com/danier97/LDMVFI.



Paperid:163
Authors:Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
University of Central Florida, Adobe Research, University of Central Florida
Abstract:
Selfsupervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage.



Paperid:164
Authors:Yongjian Deng, Hao Chen, Youfu Li
Beijing University of Technology Engineering Research Center of Intelligence Perception and Autonomous Control, Southeast University, City University of Hong Kong
Abstract:
Recent advances in eventbased research prioritize sparsity and temporal precision. Approaches learning sparse point-based representations through graph CNNs (GCN) become more popular. Yet, these graph techniques hold lower performance than their frame-based counterpart due to two issues: (i) Biased graph structures that don't properly incorporate varied attributes (such as semantics, and spatial and temporal signals) for each vertex, resulting in inaccurate graph representations. (ii) A shortage of robust pretrained models. Here we solve the first problem by proposing a new event-based GCN (EDGCN), with a dynamic aggregation module to integrate all attributes of vertices adaptively. To address the second problem, we introduce a novel learning framework called cross-representation distillation (CRD), which leverages the dense representation of events as a cross-representation auxiliary to provide additional supervision and prior knowledge for the event graph. This frame-to-graph distillation allows us to benefit from the large-scale priors provided by CNNs while still retaining the advantages of graph-based models. Extensive experiments show our model and learning framework are effective and generalize well across multiple vision tasks.



Paperid:165
Authors:Yuxin Deng, Kaining Zhang, Shihua Zhang, Yansheng Li, Jiayi Ma
Wuhan University, Wuhan University, Wuhan University, Wuhan University, Wuhan University
Abstract:
Attentionbased graph neural networks have made great progress in feature matching. However, the literature lacks a comprehensive understanding of how the attention mechanism operates for feature matching. In this paper, we rethink cross- and self-attention from the viewpoint of traditional feature matching and filtering. To facilitate the learning of matching and filtering, we incorporate the similarity of descriptors into cross-attention and relative positions into self-attention. In this way, the attention can concentrate on learning residual matching and filtering functions with reference to the basic functions of measuring visual and spatial correlation. Moreover, we leverage descriptor similarity and relative positions to extract inter- and intra-neighbors. Then sparse attention for each point can be performed only within its neighborhoods to acquire higher computation efficiency. Extensive experiments, including feature matching, pose estimation and visual localization, confirm the superiority of the proposed method. Our codes are available at https://github.com/ACuOoOoO/ResMatch.



Paperid:166
Authors:Yuxin Deng, Jiayi Ma
Wuhan University, Wuhan University
Abstract:
Rescaling the backpropagated gradient of contrastive loss has made significant progress in descriptor learning. However, current gradient modulation strategies have no regard for the varying distribution of global gradients, so they would suffer from changes in training phases or datasets. In this paper, we propose a dynamic gradient modulation, named SDGMNet, for contrastive local descriptor learning. The core of our method is formulating modulation functions with dynamically estimated statistical characteristics. Firstly, we introduce angle for distance measure after deep analysis on backpropagation of pairwise loss. On this basis, auto-focus modulation is employed to moderate the impact of statistically uncommon individual pairs in stochastic gradient descent optimization; probabilistic margin cuts off the gradients of proportional triplets that have achieved enough optimization; power adjustment balances the total weights of negative pairs and positive pairs. Extensive experiments demonstrate that our novel descriptor surpasses previous state-of-the-art methods in several tasks including patch verification, retrieval, pose estimation, and 3D reconstruction.



Paperid:167
Authors:Shanding Diao, Yuan Chen, Yang Zhao, Wei Jia, Zhao Zhang, Ronggang Wang
School of Computer and Information, Hefei University of Technology, Hefei 230009, China, School of Internet, Anhui University, Hefei 230039, China, School of Computer and Information, Hefei University of Technology, Hefei 230009, China Peng Cheng National Laboratory, Shenzhen 518000, China, School of Computer and Information, Hefei University of Technology, Hefei 230009, China, School of Computer and Information, Hefei University of Technology, Hefei 230009, China, Peng Cheng National Laboratory, Shenzhen 518000, China School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen 518055, China
Abstract:
With the rapid development of 3D movie and lightfield displays, there is a growing demand for stereo videos. However, generating high-quality stereo videos from planar videos remains a challenging task. Traditional depth-image-based rendering techniques struggle to effectively handle the problem of occlusion exposure, which occurs when the occluded contents become visible in other views. Recently, the single-view multiplane images (MPI) representation has shown promising performance for planar video stereoscopy. However, the MPI still lacks real details that are occluded in the current frame, resulting in blurry artifacts in occlusion exposure regions. In fact, planar videos can leverage complementary information from adjacent frames to predict a more complete scene representation for the current frame. Therefore, this paper extends the MPI from still frames to the temporal domain, introducing the temporal MPI (TMPI). By extracting complementary information from adjacent frames based on optical flow guidance, obscured regions in the current frame can be effectively repaired. Additionally, a new module called masked optical flow warping (MOFW) is introduced to improve the propagation of pixels along optical flow trajectories. Experimental results demonstrate that the proposed method can generate high-quality stereoscopic or light-field videos from a single view and reproduce better occluded details than other state-of-the-art (SOTA) methods. https://github.com/Dio3ding/TMPI



Paperid:168
Authors:Kun Ding, Haojian Zhang, Qiang Yu, Ying Wang, Shiming Xiang, Chunhong Pan
State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences, Engineering Laboratory for Intelligent Industrial Vision Institute of Automation, Chinese Academy of Sciences, Research Center of Aerospace Information Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences, Research Center of Aerospace Information Institute of Automation, Chinese Academy of Sciences
Abstract:
We propose a generalized method for boosting the generalization ability of pretrained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting.



Paperid:169
Authors:Pengxiang Ding, Qiongjie Cui, Haofan Wang, Min Zhang, Mengyuan Liu, Donglin Wang
MiLAB, Westlake University Zhejiang University, Nanjing University of Science and Technology Xiaohongshu Inc., Xiaohongshu Inc., MiLAB, Westlake University Zhejiang University, Shenzhen Graduate School, Peking University, Westlake University
Abstract:
Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many realworld applications. However, existing works typically concentrate on foretelling the major joints of the human body without considering the delicate movements of the human hands. In practical applications, hand gesture plays an important role in human communication with the real world, and expresses the primary intention of human beings. In this work, we are the first to formulate whole-body human pose forecasting task, which jointly predicts future both body and gesture activities. Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) framework that aims to predict both coarse (body joints) and fine-grained (gestures) activities collaboratively, enabling expressive and cross-facilitated forecasting of 3D whole-body human motions. Specifically, our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI). Considering the heterogeneous information within the whole-body, XCA aims to align the latent features of various human components, while XCI focuses on effectively capturing the context interaction among the human components. We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-the-art performance. The code is public for research purposes at https://github.com/Dingpx/EAI.



Paperid:170
Authors:Xinlong Ding, Jiansheng Chen, Hongwei Yu, Yu Shang, Yining Qin, Huimin Ma
University of Science and Technology Beijing, University of Science and Technology Beijing, University of Science and Technology Beijing, Tsinghua University, University of Science and Technology Beijing, University of Science and Technology Beijing
Abstract:
Transferable blackbox adversarial attacks against classifiers by disturbing the intermediate-layer features have been extensively studied in recent years. However, these methods have not yet achieved satisfactory performances when directly applied to object detectors. This is largely because the features of detectors are fundamentally different from that of the classifiers. In this study, we propose a simple but effective method to improve the transferability of adversarial examples for object detectors by leveraging the properties of spatial consistency and limited equivariance of object detectors’ features. Specifically, we combine a novel loss function and deliberately designed data augmentation to distort the backbone features of object detectors by suppressing significant features corresponding to objects and amplifying the surrounding vicinal features corresponding to object boundaries. As such the target object and background area on the generated adversarial samples are more likely to be confused by other detectors. Extensive experimental results show that our proposed method achieves state-of-the-art black-box transferability for untargeted attacks on various models, including one/two-stage, CNN/Transformer-based, and anchor-free/anchor-based detectors.



Paperid:171
Authors:Thang Doan, Xin Li, Sima Behpour, Wenbin He, Liang Gou, Liu Ren
Bosch Research North America & Bosch Center for Artificial Intelligence, Bosch Research North America & Bosch Center for Artificial Intelligence, Bosch Research North America & Bosch Center for Artificial Intelligence, Bosch Research North America & Bosch Center for Artificial Intelligence, Bosch Research North America & Bosch Center for Artificial Intelligence, Bosch Research North America & Bosch Center for Artificial Intelligence
Abstract:
Open World Object Detection (OWOD) is a challenging and realistic task that extends beyond the scope of standard Object Detection task. It involves detecting both known and unknown objects while integrating learned knowledge for future tasks. However, the level of "unknownness" varies significantly depending on the context. For example, a tree is typically considered part of the background in a selfdriving scene, but it may be significant in a household context. We argue that this contextual information should already be embedded within the known classes. In other words, there should be a semantic or latent structure relationship between the known and unknown items to be discovered. Motivated by this observation, we propose Hyp-OW, a method that learns and models hierarchical representation of known items through a SuperClass Regularizer. Leveraging this representation allows us to effectively detect unknown objects using a similarity distance-based relabeling module. Extensive experiments on benchmark datasets demonstrate the effectiveness of Hyp-OW, achieving improvement in both known and unknown detection (up to 6 percent). These findings are particularly pronounced in our newly designed benchmark, where a strong hierarchical structure exists between known and unknown objects.



Paperid:172
Authors:Wen Dong, Haiyang Mei, Ziqi Wei, Ao Jin, Sen Qiu, Qiang Zhang, Xin Yang
Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology, Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology Show Lab, National University of Singapore, Institute of Automation, Chinese Academy of Sciences State Key Laboratory of Structural Analysis for Industrial Equipment, Dalian University of Technology, Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology, Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology, Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology, Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
Abstract:
Car detection is an important task that serves as a crucial prerequisite for many automated driving functions. The large variations in lighting/weather conditions and vehicle densities of the scenes pose significant challenges to existing car detection algorithms to meet the highly accurate perception demand for safety, due to the unstable/limited color information, which impedes the extraction of meaningful/discriminative features of cars. In this work, we present a novel learningbased car detection method that leverages trichromatic linear polarization as an additional cue to disambiguate such challenging cases. A key observation is that polarization, characteristic of the light wave, can robustly describe intrinsic physical properties of the scene objects in various imaging conditions and is strongly linked to the nature of materials for cars (e.g., metal and glass) and their surrounding environment (e.g., soil and trees), thereby providing reliable and discriminative features for robust car detection in challenging scenes. To exploit polarization cues, we first construct a pixel-aligned RGB-Polarization car detection dataset, which we subsequently employ to train a novel multimodal fusion network. Our car detection network dynamically integrates RGB and polarization features in a request-and-complement manner and can explore the intrinsic material properties of cars across all learning samples. We extensively validate our method and demonstrate that it outperforms state-of-the-art detection methods. Experimental results show that polarization is a powerful cue for car detection. Our code is available at https://github.com/wind1117/AAAI24-PCDNet.



Paperid:173
Authors:Wenqian Dong, Yang Xu, Jiahui Qu, Shaoxiong Hou
State Key Laboratory of Integrated Service Network, Xidian University, Xi'an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi'an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi'an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi'an 710071, China
Abstract:
Hyperspectral image superresolution (HSI-SR) is a technology to improve the spatial resolution of HSI. Existing fusion-based SR methods have shown great performance, but still have some problems as follows: 1) existing methods assume that the auxiliary image providing spatial information is strictly registered with the HSI, but images are difficult to be registered finely due to the shooting platforms, shooting viewpoints and the influence of atmospheric turbulence; 2) most of the methods are based on convolutional neural networks (CNNs), which is effective for local features but cannot utilize the global features. To this end, we propose a multi-modal cross-scale deformable transformer network (M2DTN) to achieve unregistered HSI-SR. Specifically, we formulate a spectrum-preserving based spatial-guided registration-SR unified model (SSRU) from the view of the realistic degradation scenarios. According to SSRU, we propose multi-modal registration deformable module (MMRD) to align features between different modalities by deformation field. In order to efficiently utilize the unique information between different modals, we design multi-scale feature transformer (MSFT) to emphasize the spatial-spectral features at different scales. In addition, we propose the cross-scale feature aggregation module (CSFA) to accurately reconstruct the HSI by aggregating feature information at different scales. Experiments show that M2DTN outperforms the-state-of-the-art HSI-SR methods. Code is obtainable at https://github.com/Jiahuiqu/M2DTN.



Paperid:174
Authors:Yanchen Dong, Ruiqin Xiong, Jing Zhao, Jian Zhang, Xiaopeng Fan, Shuyuan Zhu, Tiejun Huang
Peking University, Peking University, National Computer Network Emergency Response Technical Team Peking University, Peking University, Harbin Institute of Technology, University of Electronic Science and Technology of China, Peking University
Abstract:
As a neuromorphic camera with high temporal resolution, spike camera can capture dynamic scenes with highspeed motion. Recently, spike camera with a color filter array (CFA) has been developed for color imaging. There are some methods for spike camera demosaicing to reconstruct color images from Bayer-pattern spike streams. However, the demosaicing results are bothered by severe noise in spike streams, to which previous works pay less attention. In this paper, we propose an iterative joint demosaicing and denoising network (SJDD-Net) for spike cameras based on the observation model. Firstly, we design a color spike representation (CSR) to learn latent representation from Bayer-pattern spike streams. In CSR, we propose an offset-sharing deformable convolution module to align temporal features of color channels. Then we develop a spike noise estimator (SNE) to obtain features of the noise distribution. Finally, a color correlation prior (CCP) module is proposed to utilize the color correlation for better details. For training and evaluation, we designed a spike camera simulator to generate Bayer-pattern spike streams with synthesized noise. Besides, we captured some Bayer-pattern spike streams, building the first real-world captured dataset to our knowledge. Experimental results show that our method can restore clean images from Bayer-pattern spike streams. The source codes and dataset are available at https://github.com/csycdong/SJDD-Net.



Paperid:175
Authors:Yi Dong, Yuxi Wang, Ruoxi Fan, Wenqi Ouyang, Zhiqi Shen, Peiran Ren, Xuansong Xie
Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Alibaba Group, Nanyang Technological University, Alibaba Group, Alibaba Group
Abstract:
Digital image enhancement aims to deliver visually striking, pleasing images that align with human perception. While global techniques can elevate the image's overall aesthetics, finegrained color enhancement can further boost visual appeal and expressiveness. However, colorists frequently face challenges in achieving accurate, localized color adjustments. Direct composition of these local edits can result in spatial color inconsistencies. Existing methods, including color style transfer and image harmonization, exhibit inconsistencies, especially at boundary regions. Addressing this, we present ChromaFusionNet (CFNet), a novel approach that views the color fusion problem through the lens of image color inpainting. Built on the Vision Transformer architecture, CFNet captures global context and delivers high-fidelity outputs, seamlessly blending colors while preserving boundary integrity. Empirical studies on ImageNet and COCO datasets demonstrate CFNet's superiority over existing methods in maintaining color harmony and color fidelity. Robustness evaluations and user studies have further validated the effectiveness of CFNet. In conclusion, CFNet introduces an innovative approach to seamless, fine-grained color fusion, paving the way for advancements in the domain of fine-grained color editing. Code and pretrained models are available at our project page: https://yidong.pro/projects/cfnet.



Paperid:176
Authors:Yilan Dong, Chunlin Yu, Ruiyang Ha, Ye Shi, Yuexin Ma, Lan Xu, Yanwei Fu, Jingya Wang
ShanghaiTech University, ShanghaiTech University, ShanghaiTech University, ShanghaiTech University, ShanghaiTech University, ShanghaiTech University, Fudan University, ShanghaiTech University
Abstract:
Existing gait recognition benchmarks mostly include minor clothing variations in the laboratory environments, but lack persistent changes in appearance over time and space. In this paper, we propose the first inthe-wild benchmark CCGait for cloth-changing gait recognition, which incorporates diverse clothing changes, indoor and outdoor scenes, and multi-modal statistics over 92 days. To further address the coupling effect of clothing and viewpoint variations, we propose a hybrid approach HybridGait that exploits both temporal dynamics and the projected 2D information of 3D human meshes. Specifically, we introduce a Canonical Alignment Spatial-Temporal Transformer (CA-STT) module to encode human joint position-aware features, and fully exploit 3D dense priors via a Silhouette-guided Deformation with 3D-2D Appearance Projection (SilD) strategy. Our contributions are twofold: we provide a challenging benchmark CCGait that captures realistic appearance changes over expanded time and space, and we propose a hybrid framework HybridGait that outperforms prior works on CCGait and Gait3D benchmarks. Our project page is available at https://github.com/HCVLab/HybridGait.



Paperid:177
Authors:Yue-Jiang Dong, Yuan-Chen Guo, Ying-Tian Liu, Fang-Lue Zhang, Song-Hai Zhang
Tsinghua University, Tsinghua University, Tsinghua University, Victoria University of Wellington, Tsinghua University
Abstract:
Selfsupervised monocular depth estimation is of significant importance with applications spanning across autonomous driving and robotics. However, the reliance on self-supervision introduces a strong static-scene assumption, thereby posing challenges in achieving optimal performance in dynamic scenes, which are prevalent in most real-world situations. To address these issues, we propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to transfer a pre-trained image model for self-supervised depth estimation. The training comprises two sequential stages: an initial phase trained on a dataset primarily composed of static scenes, succeeded by an expansion to more intricate datasets involving dynamic scenes. To facilitate this process, we design compact encoder and decoder adapters to enable parameter-efficient tuning, allowing the network to adapt effectively. They not only uphold generalized patterns from pre-trained image models but also retain knowledge gained from the preceding phase into the subsequent one. Extensive experiments demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI, CityScapes and DDAD datasets.



Paperid:178
Authors:Chenghu Du, Junyin Wang, Yi Rong, Shuqing Liu, Kai Liu, Shengwu Xiong
Wuhan University of Technology, Wuhan University of Technology, Wuhan University of Technology Sanya Science and Education Innovation Park, Wuhan University of Technology, Wuhan Textile University, Wuhan University of Technology, Wuhan University of Technology Shanghai AI Laboratory Sanya Science and Education Innovation Park, Wuhan University of Technology Qiongtai Normal University
Abstract:
Imagebased virtual try-on aims to transfer a target clothing onto a specific person. A significant challenge is arbitrarily matched clothing and person lack corresponding ground truth to supervised learning. A recent pioneering work leveraged an improved cycleGAN to enable one network to generate the desired image for another network during training. However, there is no difference in the result distribution before and after the clothing changes. Therefore, using two different networks is unnecessary and may even increase the difficulty of convergence. Furthermore, the introduced human parsing used to provide body structure information in the input also have a negative impact on the try-on result. How to employ a single network for supervised learning while eliminating human parsing? To tackle these issues, we present a Cycle mapping Virtual Try-On Network (CycleVTON), which can produce photo-realistic try-on results by using a cycle mapping framework without the parser. In particular, we introduce a flow constraint loss to achieve supervised learning of arbitrarily matched clothing and person as inputs to the deformer, thus naturally mimicking the interaction between clothing and the human body. Additionally, we design a skin generation strategy that can adapt to the shape of the target clothing by dynamically adjusting the skin region, i.e., by first removing and then filling skin areas. Extensive experiments conducted on challenging benchmarks demonstrate that our proposed method exhibits superior performance compared to state-of-the-art methods.



Paperid:179
Authors:Hang Du, Xuejun Yan, Jingjing Wang, Di Xie, Shiliang Pu
Hikvision Research Institute, Zhejiang Key Laboratory of Social Security Big Data, Hikvision Research Institute, Zhejiang Key Laboratory of Social Security Big Data, Hikvision Research Institute, Zhejiang Key Laboratory of Social Security Big Data, Hikvision Research Institute, Zhejiang Key Laboratory of Social Security Big Data, Hikvision Research Institute, Zhejiang Key Laboratory of Social Security Big Data
Abstract:
Recently, arbitraryscale point cloud upsampling mechanism became increasingly popular due to its efficiency and convenience for practical applications. To achieve this, most previous approaches formulate it as a problem of surface approximation and employ point-based networks to learn surface representations. However, learning surfaces from sparse point clouds is more challenging, and thus they often suffer from the low-fidelity geometry approximation. To address it, we propose an arbitrary-scale Point cloud Upsampling framework using Voxel-based Network (PU-VoxelNet). Thanks to the completeness and regularity inherited from the voxel representation, voxel-based networks are capable of providing predefined grid space to approximate 3D surface, and an arbitrary number of points can be reconstructed according to the predicted density distribution within each grid cell. However, we investigate the inaccurate grid sampling caused by imprecise density predictions. To address this issue, a density-guided grid resampling method is developed to generate high-fidelity points while effectively avoiding sampling outliers. Further, to improve the fine-grained details, we present an auxiliary training supervision to enforce the latent geometric consistency among local surface patches. Extensive experiments indicate the proposed approach outperforms the state-of-the-art approaches not only in terms of fixed upsampling rates but also for arbitrary-scale upsampling. The code is available at https://github.com/hikvision-research/3DVision



Paperid:180
Authors:Zhenjiang Du, Jiale Dou, Zhitao Liu, Jiwei Wei, Guan Wang, Ning Xie, Yang Yang
University of Electronic Science and Technology of China, Chengdu, China, Yibin Park, University of Electronic Science and Technology of China, Yibin, China, University of Electronic Science and Technology of China, Chengdu, China, University of Electronic Science and Technology of China, Chengdu, China, Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China, University of Electronic Science and Technology of China, Chengdu, China, University of Electronic Science and Technology of China, Chengdu, China
Abstract:
Point cloud completion aims at completing shapes from their partial. Most existing methods utilized shape’s priors information for point cloud completion, such as inputting the partial and getting the complete one through an encoderdecoder deep learning structure. However, it is very often to easily cause the loss of information in the generation process because of the invisibility of missing areas. Unlike most existing methods directly inferring the missing points using shape priors, we address it as a cross-modality task. We propose a new Cross-modal Dual Phases Network (CDPNet) for shape completion. Our key idea is that the global information of the shape is obtained from the extra single-view image, and the partial point clouds provide the geometric information. After that, the multi-modal features jointly guide the specific structural information. To learn the geometric details of the shape, we chose to use patches to preserve the local geometric feature. In this way, we can generate shapes with enough geometric details. Experimental results show that our method achieves state-of-the-art performance on point cloud completion.



Paperid:181
Authors:Xiaoyue Duan, Shuhao Cui, Guoliang Kang, Baochang Zhang, Zhengcong Fei, Mingyuan Fan, Junshi Huang
School of Automation Science and Electrical Engineering, Beihang University, China Meituan, Meituan, School of Automation Science and Electrical Engineering, Beihang University, China Zhongguancun Laboratory, Beijing, China, School of Automation Science and Electrical Engineering, Beihang University, China Zhongguancun Laboratory, Beijing, China Hangzhou Research Institute, Beihang University, China Nanchang Institute of Technology, Nanchang, China, Meituan, Meituan, Meituan
Abstract:
Consistent editing of real images is a challenging task, as it requires performing nonrigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.



Paperid:182
Authors:Yuxuan Duan, Li Niu, Yan Hong, Liqing Zhang
MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Tiansuan Lab, Ant Group, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Abstract:
In fewshot image generation, directly training GAN models on just a handful of images faces the risk of overfitting. A popular solution is to transfer the models pretrained on large source domains to small target ones. In this work, we introduce WeditGAN, which realizes model transfer by editing the intermediate latent codes w in StyleGANs with learned constant offsets (delta w), discovering and constructing target latent spaces via simply relocating the distribution of source latent spaces. The established one-to-one mapping between latent spaces can naturally prevents mode collapse and overfitting. Besides, we also propose variants of WeditGAN to further enhance the relocation process by regularizing the direction or finetuning the intensity of delta w. Experiments on a collection of widely used source/target datasets manifest the capability of WeditGAN in generating realistic and diverse images, which is simple yet highly effective in the research area of few-shot image generation. Codes are available at https://github.com/Ldhlwh/WeditGAN.



Paperid:183
Authors:Chao Fan, Jingzhe Ma, Dongyang Jin, Chuanfu Shen, Shiqi Yu
Research Institute of Trustworthy Autonomous System, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology, Research Institute of Trustworthy Autonomous System, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology, Research Institute of Trustworthy Autonomous System, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology, Research Institute of Trustworthy Autonomous System, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology The University of Hong Kong, Research Institute of Trustworthy Autonomous System, Southern University of Science and Technology Department of Computer Science and Engineering, Southern University of Science and Technology
Abstract:
The choice of the representations is essential for deep gait recognition methods. The binary silhouettes and skeletal coordinates are two dominant representations in recent literature, achieving remarkable advances in many scenarios. However, inherent challenges remain, in which silhouettes are not always guaranteed in unconstrained scenes, and structural cues have not been fully utilized from skeletons. In this paper, we introduce a novel skeletal gait representation named skeleton map, together with SkeletonGait, a skeletonbased method to exploit structural information from human skeleton maps. Specifically, the skeleton map represents the coordinates of human joints as a heatmap with Gaussian approximation, exhibiting a silhouette-like image devoid of exact body structure. Beyond achieving state-of-the-art performances over five popular gait datasets, more importantly, SkeletonGait uncovers novel insights about how important structural features are in describing gait and when they play a role. Furthermore, we propose a multi-branch architecture, named SkeletonGait++, to make use of complementary features from both skeletons and silhouettes. Experiments indicate that SkeletonGait++ outperforms existing state-of-the-art methods by a significant margin in various scenarios. For instance, it achieves an impressive rank-1 accuracy of over 85% on the challenging GREW dataset. The source code is available at https://github.com/ShiqiYu/OpenGait.



Paperid:184
Authors:Yang Fan, Xiangping Wu, Qingcai Chen, Heng Li, Yan Huang, Zhixiang Cai, Qitian Wu
Harbin Institute of Technology (Shenzhen), Harbin Institute of Technology (Shenzhen) The Hong Kong Polytechnic University, Harbin Institute of Technology (Shenzhen) Peng Cheng Laboratory, Harbin Institute of Technology (Shenzhen), China Mobile Information Technology Co.,Ltd, China Mobile Information Technology Co.,Ltd, China Mobile Information Technology Co.,Ltd
Abstract:
The diversity of tables makes table detection a great challenge, leading to existing models becoming more tedious and complex. Despite achieving high performance, they often overfit to the table style in training set, and suffer from significant performance degradation when encountering outof-distribution tables in other domains. To tackle this problem, we start from the essence of the table, which is a set of text arranged in rows and columns. Based on this, we propose a novel, light-weighted and robust Table Detection method based on Learning Text Arrangement, namely TDeLTA. TDeLTA takes the text blocks as input, and then models the arrangement of them with a sequential encoder and an attention module. To locate the tables precisely, we design a text-classification task, classifying the text blocks into 4 categories according to their semantic roles in the tables. Experiments are conducted on both the text blocks parsed from PDF and extracted by open-source OCR tools, respectively. Compared to several state-of-the-art methods, TDeLTA achieves competitive results with only 3.1M model parameters on the large-scale public datasets. Moreover, when faced with the cross-domain data under the 0-shot setting, TDeLTA outperforms baselines by a large margin of nearly 7%, which shows the strong robustness and transferability of the proposed model.



Paperid:185
Authors:Yeying Fan, Guangshun Wei, Chen Wang, Shaojie Zhuang, Wenping Wang, Yuanfeng Zhou
School of Software, Shandong University, China, Department of Computer Science, The University of Hong Kong, China, School of Software, Shandong University, China, School of Software, Shandong University, China, Department of Computer Science, The University of Hong Kong, China Texas A&M University, USA, School of Software, Shandong University, China
Abstract:
Tooth motion generation is an essential task in digital orthodontic treatment for precise and quick dental healthcare, which aims to generate the whole intermediate tooth motion process given the initial pathological and target ideal tooth alignments. Most prior works for multiagent motion planning problems usually result in complex solutions. Moreover, the occlusal relationship between upper and lower teeth is often overlooked. In this paper, we propose a collaborative tooth motion diffusion model. The critical insight is to remodel the problem as a diffusion process. In this sense, we model the whole tooth motion distribution with a diffusion model and transform the planning problem into a sampling process from this distribution. We design a tooth latent representation to provide accurate conditional guides consisting of two key components: the tooth frame represents the position and posture, and the tooth latent shape code represents the geometric morphology. Subsequently, we present a collaborative diffusion model to learn the multi-tooth motion distribution based on inter-tooth and occlusal constraints, which are implemented by graph structure and new loss functions, respectively. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in the application of orthodontics compared with state-of-the-art methods.



Paperid:186
Authors:Zhaoxin Fan, Longbin Ji, Pengxin Xu, Fan Shen, Kai Chen
Psyche AI Inc, Xi'an Jiaotong Liverpool University, Psyche AI Inc, Psyche AI Inc, HKUST
Abstract:
In the dynamic field of film and game development, the emergence of human motion synthesis methods has revolutionized avatar animation. Traditional methodologies, typically reliant on single modality inputs like text or audio, employ modalityspecific model frameworks, posing challenges for unified model deployment and application. To address this, we propose Everything2Motion, a unified model framework. Everything2Motion consists of three key modules. The Input-Output Modality Modulation module tailors structures for specific multimodal inputs, eliminating the need for modality-specific frameworks. The Query-aware Autoencoder, based on the transformer encoder-decoder architecture, enables efficient latent motion generation. Lastly, the Prior Motion Distillation Decoder, a pretrained module, enhances the final skeleton sequence's naturalness and fluidity. Comprehensive experiments on several public datasets demonstrate the effectiveness of Everything2Motion, highlighting its potential for practical applications and setting a new benchmark in human motion synthesis.



Paperid:187
Authors:Chaowei Fang, Ziyin Zhou, Junye Chen, Hanjing Su, Qingyao Wu, Guanbin Li
Xidian University, Xidian University, Sun Yat-sen University, Tencent, South China University of Technology, Sun Yat-sen University GuangDong Province Key Laboratory of Information Security Technology
Abstract:
Pointbased interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing. However, fully extracting the target mask with limited user inputs remains challenging. We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs. Regarding the last segmentation result as the initial mask, an iterative refinement process is commonly employed to continually enhance the initial mask. Nevertheless, conventional techniques suffer from sensitivity to the variance in the initial mask. To circumvent this problem, our proposed method incorporates a mask matching algorithm for ensuring consistent inferences from different types of initial masks. We also introduce a target-aware zooming algorithm to preserve object information during downsampling, balancing efficiency and accuracy. Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.



Paperid:188
Authors:Qihang Fang, Yafei Song, Keqiang Li, Li Shen, Huaiyu Wu, Gang Xiong, Liefeng Bo
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Alibaba Group
Abstract:
A radiance field is an effective representation of 3D scenes, which has been widely adopted in novelview synthesis and 3D reconstruction. It is still an open and challenging problem to evaluate the geometry, i.e., the density field, as the ground-truth is almost impossible to obtain. One alternative indirect solution is to transform the density field into a point-cloud and compute its Chamfer Distance with the scanned ground-truth. However, many widely-used datasets have no point-cloud ground-truth since the scanning process along with the equipment is expensive and complicated. To this end, we propose a novel metric, named Inverse Mean Residual Color (IMRC), which can evaluate the geometry only with the observation images. Our key insight is that the better the geometry, the lower-frequency the computed color field. From this insight, given a reconstructed density field and observation images, we design a closed-form method to approximate the color field with low-frequency spherical harmonics, and compute the inverse mean residual color. Then the higher the IMRC, the better the geometry. Qualitative and quantitative experimental results verify the effectiveness of our proposed IMRC metric. We also benchmark several state-of-the-art methods using IMRC to promote future related research. Our code is available at https://github.com/qihangGH/IMRC.



Paperid:189
Authors:Ruohuan Fang, Guansong Pang, Xiao Bai
School of Computer Science and Engineering, Beihang University, School of Computing and Information Systems, Singapore Management University, School of Computer Science and Engineering, Beihang University State Key Laboratory of Software Development Environment, Jiangxi Research Institute, Beihang University
Abstract:
OpenVocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training, to expand the detection vocabulary. These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale image-level text descriptions. This limits their capability in detecting hard objects of small, blurred, or occluded appearance from novel/base categories, whose detection heavily relies on contextual information. To address this, we propose a novel approach, namely Simple Image-level Classification for Context-Aware Detection Scoring (SIC-CADS), to leverage the superior global knowledge yielded from CLIP for complementing the current OVOD models from a global perspective. The core of SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the object co-occurrence-based contextual information from CLIP to recognize all possible object categories in the scene. These image-level MLR scores can then be utilized to refine the instance-level detection scores of the current OVOD models in detecting those hard objects. This is verified by extensive empirical results on two popular benchmarks, OV-LVIS and OV-COCO, which show that SIC-CADS achieves significant and consistent improvement when combined with different types of OVOD models. Further, SIC-CADS also improves the cross-dataset generalization ability on Objects365 and OpenImages. Code is available at https://github.com/mala-lab/SIC-CADS.



Paperid:190
Authors:Shaoheng Fang, Zuhong Liu, Mingyu Wang, Chenxin Xu, Yiqi Zhong, Siheng Chen
Shanghai Jiao Tong University, Shanghai JIaoTong University, University of Chinese Academy of Sciences, Shanghai Jiao Tong University, University of Southern California, Shanghai Jiao Tong University Shanghai AI Laboratory
Abstract:
Learning the dense bird's eye view (BEV) motion flow in a selfsupervised manner is an emerging research for robotics and autonomous driving. Current self-supervised methods mainly rely on point correspondences between point clouds, which may introduce the problems of fake flow and inconsistency, hindering the model’s ability to learn accurate and realistic motion. In this paper, we introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data to obtain supervision signals. We design three innovative supervision signals to preserve the inherent properties of scene motion, including the masked Chamfer distance loss, the piecewise rigidity loss, and the temporal consistency loss. Through extensive experiments, we demonstrate that our proposed self-supervised framework outperforms all previous self-supervision methods for the motion prediction task.



Paperid:191
Authors:Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, Renfu Li
Huazhong University of Science and Technology, Peking University, Henan University Huazhong University of Science and Technology, Huazhong University of Science and Technology, Dalian University of Technology, Sichuan University, Shenzhen Univeristy, Huazhong University of Science and Technology
Abstract:
Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target queryrelevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.



Paperid:192
Authors:Zhixue Fang, Xinrong Guo, Jingyin Lin, Huisi Wu, Jing Qin
Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University, The Hong Kong Polytechnic University
Abstract:
Automatic polyp segmentation from colonoscopy videos is a critical task for the development of computeraided screening and diagnosis systems. However, accurate and real-time video polyp segmentation (VPS) is a very challenging task due to low contrast between background and polyps and frame-to-frame dramatic variations in colonoscopy videos. We propose a novel embedding-unleashing framework consisting of a proposal-generative network (PGN) and an appearance-embedding network (AEN) to comprehensively address these challenges. Our framework, for the first time, models VPS as an appearance-level semantic embedding process to facilitate generate more global information to counteract background disturbances and dramatic variations. Specifically, PGN is a video segmentation network to obtain segmentation mask proposals, while AEN is a network we specially designed to produce appearance-level embedding semantics for PGN, thereby unleashing the capability of PGN in VPS. Our AEN consists of a cross-scale region linking (CRL) module and a cross-wise scale alignment (CSA) module. The former screens reliable background information against background disturbances by constructing linking of region semantics, while the latter performs the scale alignment to resist dramatic variations by modeling the center-perceived motion dependence with a cross-wise manner. We further introduce a parameter-free semantic interaction to embed the semantics of AEN into PGN to obtain the segmentation results. Extensive experiments on CVC-612 and SUN-SEG demonstrate that our approach achieves better performance than other state-of-the-art methods. Codes are available at https://github.com/zhixue-fang/EUVPS.



Paperid:193
Authors:Juexiao Feng, Yuhong Yang, Yanchun Xie, Yaqian Li, Yandong Guo, Yuchen Guo, Yuwei He, Liuyu Xiang, Guiguang Ding
Tsinghua University BNRist Hangzhou Zhuoxi Institute of Brain and Intelligence, Tsinghua University BNRist Hangzhou Zhuoxi Institute of Brain and Intelligence, OPPO Research Institute, OPPO Research Institute, OPPO Research Institute, Tsinghua University BNRist, Tsinghua University BNRist, Beijing University of Posts and Telecommunications, Tsinghua University BNRist
Abstract:
In recent years, object detection in deep learning has experienced rapid development. However, most existing object detection models perform well only on closedset datasets, ignoring a large number of potential objects whose categories are not defined in the training set. These objects are often identified as background or incorrectly classified as pre-defined categories by the detectors. In this paper, we focus on the challenging problem of Novel Class Discovery and Localization (NCDL), aiming to train detectors that can detect the categories present in the training data, while also actively discover, localize, and cluster new categories. We analyze existing NCDL methods and identify the core issue: object detectors tend to be biased towards seen objects, and this leads to the neglect of unseen targets. To address this issue, we first propose an Debiased Region Mining (DRM) approach that combines class-agnostic Region Proposal Network (RPN) and class-aware RPN in a complementary manner. Additionally, we suggest to improve the representation network through semi-supervised contrastive learning by leveraging unlabeled data. Finally, we adopt a simple and efficient mini-batch K-means clustering method for novel class discovery. We conduct extensive experiments on the NCDL benchmark, and the results demonstrate that the proposed DRM approach significantly outperforms previous methods, establishing a new state-of-the-art.



Paperid:194
Authors:Tuo Feng, Ruijie Quan, Xiaohan Wang, Wenguan Wang, Yi Yang
University of Technology Sydney, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
3D decisioncritical tasks urgently require research on explanations to ensure system reliability and transparency. Extensive explanatory research has been conducted on 2D images, but there is a lack in the 3D field. Furthermore, the existing explanations for 3D models are post-hoc and can be misleading, as they separate explanations from the original model. To address these issues, we propose an ad-hoc interpretable classifier for 3D point clouds (i.e., Interpretable3D). As an intuitive case-based classifier, Interpretable3D can provide reliable ad-hoc explanations without any embarrassing nuances. It allows users to understand how queries are embedded within past observations in prototype sets. Interpretable3D has two iterative training steps: 1) updating one prototype with the mean of the embeddings within the same sub-class in Prototype Estimation, and 2) penalizing or rewarding the estimated prototypes in Prototype Optimization. The mean of embeddings has a clear statistical meaning, i.e., class sub-centers. Moreover, we update prototypes with their most similar observations in the last few epochs. Finally, Interpretable3D classifies new samples according to prototypes. We evaluate the performance of Interpretable3D on four popular point cloud models: DGCNN, PointNet2, PointMLP, and PointNeXt. Our Interpretable3D demonstrates comparable or superior performance compared to softmax-based black-box models in the tasks of 3D shape classification and part segmentation. Our code is released at: github.com/FengZicai/Interpretable3D.



Paperid:195
Authors:Hui Fu, Zeqing Wang, Ke Gong, Keze Wang, Tianshui Chen, Haojie Li, Haifeng Zeng, Wenxiong Kang
South China University of Technology, Sun Yat-sen University, X-ERA.ai, Sun Yat-sen University, Guangdong University of Technology, X-ERA.ai, X-ERA.ai, South China University of Technology
Abstract:
Speechdriven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the first attempt to explore the coupled information between the speaking style and the semantic content in facial motions. Specifically, we introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding and leads to a more realistic synthesis of speech-driven facial animations. Subsequently, we propose a novel framework called Mimic to learn disentangled representations of the speaking style and content from facial motions by building two latent spaces for style and content, respectively. Moreover, to facilitate disentangled representation learning, we introduce four well-designed constraints: an auxiliary style classifier, an auxiliary inverse classifier, a content contrastive loss, and a pair of latent cycle losses, which can effectively contribute to the construction of the identity-related style space and semantic-related content space. Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. The source code and supplementary video are publicly available at: https://zeqing-wang.github.io/Mimic/



Paperid:196
Authors:Qijun Gan, Wentong Li, Jinwei Ren, Jianke Zhu
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
Reconstructing highfidelity hand models with intricate textures plays a crucial role in enhancing human-object interaction and advancing real-world applications. Despite the state-of-the-art methods excelling in texture generation and image rendering, they often face challenges in accurately capturing geometric details. Learning-based approaches usually offer better robustness and faster inference, which tend to produce smoother results and require substantial amounts of training data. To address these issues, we present a novel fine-grained multi-view hand mesh reconstruction method that leverages inverse rendering to restore hand poses and intricate details. Firstly, our approach predicts a parametric hand mesh model through Graph Convolutional Networks (GCN) based method from multi-view images. We further introduce a novel Hand Albedo and Mesh (HAM) optimization module to refine both the hand mesh and textures, which is capable of preserving the mesh topology. In addition, we suggest an effective mesh-based neural rendering scheme to simultaneously generate photo-realistic image and optimize mesh geometry by fusing the pre-trained rendering network with vertex features. We conduct the comprehensive experiments on InterHand2.6M, DeepHandMesh and dataset collected by ourself, whose promising results show that our proposed approach outperforms the state-of-the-art methods on both reconstruction accuracy and rendering quality. Code and dataset are publicly available at https://github.com/agnJason/FMHR.



Paperid:197
Authors:Chenxing Gao, Hang Zhou, Junqing Yu, YuTeng Ye, Jiale Cai, Junle Wang, Wei Yang
Huazhong University of Science and Technology, Huazhong University of Science & Technology, Huazhong University of Science & Technology, Huazhong University of Science & Technology, Huazhong University of Science and Technology, Tencent, Huazhong University of Science and Technology
Abstract:
Understanding the mechanisms behind Vision Transformer (ViT), particularly its vulnerability to adversarial perturbations, is crucial for addressing challenges in its realworld applications. Existing ViT adversarial attackers rely on labels to calculate the gradient for perturbation, and exhibit low transferability to other structures and tasks. In this paper, we present a label-free white-box attack approach for ViT-based models that exhibits strong transferability to various black-box models, including most ViT variants, CNNs, and MLPs, even for models developed for other modalities. Our inspiration comes from the feature collapse phenomenon in ViTs, where the critical attention mechanism overly depends on the low-frequency component of features, causing the features in middle-to-end layers to become increasingly similar and eventually collapse. We propose the feature diversity attacker to naturally accelerate this process and achieve remarkable performance and transferability.



Paperid:198
Authors:Hongzhi Gao, Zheng Chen, Zehui Chen, Lin Chen, Jiaming Liu, Shanghang Zhang, Feng Zhao
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Peking University, Peking University, University of Science and Technology of China
Abstract:
Training highaccuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization. In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget. Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI). Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models. Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart.



Paperid:199
Authors:Jiayi Gao, Kongming Liang, Tao Wei, Wei Chen, Zhanyu Ma, Jun Guo
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Li Auto, Li Auto, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Human object interaction detection aims at localizing humanobject pairs and recognizing their interactions. Trapped by the long-tailed distribution of the data, existing HOI detection methods often have difficulty recognizing the tail categories. Many approaches try to improve the recognition of HOI tasks by utilizing external knowledge (e.g. pre-trained visual-language models). However, these approaches mainly utilize external knowledge at the HOI combination level and achieve limited improvement in the tail categories. In this paper, we propose a dual-prior augmented decoding network by decomposing the HOI task into two sub-tasks: human-object pair detection and interaction recognition. For each subtask, we leverage external knowledge to enhance the model's ability at a finer granularity. Specifically, we acquire the prior candidates from an external classifier and embed them to assist the subsequent decoding process. Thus, the long-tail problem is mitigated from a coarse-to-fine level with the corresponding external knowledge. Our approach outperforms existing state-of-the-art models in various settings and significantly boosts the performance on the tail HOI categories. The source code is available at https://github.com/PRIS-CV/DP-ADN.



Paperid:200
Authors:Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang, Zefang Yu, Ke Ji, Mingye Xie, Ting Liu, Yuzhuo Fu
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Southeast University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
With the success of pretrained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from natural language processing (NLP), has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment method named \textbf{LAMM}, which can dynamically adjust the category embeddings of downstream datasets through end-to-end training. Moreover, to achieve a more appropriate label distribution, we propose a hierarchical loss, encompassing the alignment of the parameter space, feature space, and logits space. We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(\%) compared to the state-of-the-art methods on 16 shots. Moreover, our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods. Importantly, our method is synergistic with existing prompt tuning methods and can boost the performance on top of them. Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM.



Paperid:201
Authors:Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu
Peking University, Peking University, Peking University, Peking University
Abstract:
Recently, textto-image diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing flexible image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework contributing a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which extracts image features carrying different DCT spectral bands to control the text-to-image generation process of the Latent Diffusion Model, realizing versatile I2I applications including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related methods, FCDiffusion establishes a unified text-driven I2I framework suiting diverse I2I application scenarios simply by switching among different frequency control branches. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.



Paperid:202
Authors:Xinyu Gao, Ziyi Yang, Yunlu Zhao, Yuxiang Sun, Xiaogang Jin, Changqing Zou
State Key Lab of CAD&CG, Zhejiang University, State Key Lab of CAD&CG, Zhejiang University, State Key Lab of CAD&CG, Zhejiang University, Zhejiang Lab, State Key Lab of CAD&CG, Zhejiang University, State Key Lab of CAD&CG, Zhejiang University Zhejiang Lab
Abstract:
A variety of Neural Radiance Fields (NeRF) methods have recently achieved remarkable success in high render speed. However, current accelerating methods are specialized and incompatible with various implicit methods, preventing realtime composition over various types of NeRF works. Because NeRF relies on sampling along rays, it is possible to provide general guidance for acceleration. To that end, we propose a general implicit pipeline for composing NeRF objects quickly. Our method enables the casting of dynamic shadows within or between objects using analytical light sources while allowing multiple NeRF objects to be seamlessly placed and rendered together with any arbitrary rigid transformations. Mainly, our work introduces a new surface representation known as Neural Depth Fields (NeDF) that quickly determines the spatial relationship between objects by allowing direct intersection computation between rays and implicit surfaces. It leverages an intersection neural network to query NeRF for acceleration instead of depending on an explicit spatial structure.Our proposed method is the first to enable both the progressive and interactive composition of NeRF objects. Additionally, it also serves as a previewing plugin for a range of existing NeRF works.



Paperid:203
Authors:Yan Gao, Haojun Xu, Jie Li, Nannan Wang, Xinbo Gao
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University Chongqing University of Posts and Telecommunications
Abstract:
The global multiobject tracking (MOT) system can consider interaction, occlusion, and other ``visual blur'' scenarios to ensure effective object tracking in long videos. Among them, graph-based tracking-by-detection paradigms achieve surprising performance. However, their fully-connected nature poses storage space requirements that challenge algorithm handling long videos. Currently, commonly used methods are still generated trajectories by building one-forward associations across frames. Such matches produced under the guidance of first-order similarity information may not be optimal from a longer-time perspective. Moreover, they often lack an end-to-end scheme for correcting mismatches. This paper proposes the Composite Node Message Passing Network (CoNo-Link), a multi-scene generalized framework for modeling ultra-long frames information for association. CoNo-Link's solution is a low-storage overhead method for building constrained connected graphs. In addition to the previous method of treating objects as nodes, the network innovatively treats object trajectories as nodes for information interaction, improving the graph neural network's feature representation capability. Specifically, we formulate the graph-building problem as a top-k selection task for some reliable objects or trajectories. Our model can learn better predictions on longer-time scales by adding composite nodes. As a result, our method outperforms the state-of-the-art in several commonly used datasets.



Paperid:204
Authors:Yudong Gao, Honglong Chen, Peng Sun, Junjian Li, Anqing Zhang, Zhibo Wang, Weifeng Liu
College of Control Science and Engineering, China University of Petroleum (East China), P.R. China, College of Control Science and Engineering, China University of Petroleum (East China), P.R. China, College of Computer Science and Electronic Engineering, Hunan University, P.R. China, College of Control Science and Engineering, China University of Petroleum (East China), P.R. China, College of Control Science and Engineering, China University of Petroleum (East China), P.R. China, School of Cyber Science and Technology, Zhejiang University, P.R. China, College of Control Science and Engineering, China University of Petroleum (East China), P.R. China
Abstract:
Backdoor attacks pose serious security threats to deep neural networks (DNNs). Backdoored models make arbitrarily (targeted) incorrect predictions on inputs containing welldesigned triggers, while behaving normally on clean inputs. Prior researches have explored the invisibility of backdoor triggers to enhance attack stealthiness. However, most of them only focus on the invisibility in the spatial domain, neglecting the generation of invisible triggers in the frequency domain. This limitation renders the generated poisoned images easily detectable by recent defense methods. To address this issue, we propose a DUal stealthy BAckdoor attack method named DUBA, which simultaneously considers the invisibility of triggers in both the spatial and frequency domains, to achieve desirable attack performance, while ensuring strong stealthiness. Specifically, we first use Wavelet Transform to embed the high-frequency information of the trigger image into the clean image to ensure attack effectiveness. Then, to attain strong stealthiness, we incorporate Fourier Transform and Cosine Transform to mix the poisoned image and clean image in the frequency domain. Moreover, DUBA adopts a novel attack strategy, training the model with weak triggers and attacking with strong triggers to further enhance attack performance and stealthiness. DUBA is evaluated extensively on four datasets against popular image classifiers, showing significant superiority over state-of-the-art backdoor attacks in attack success rate and stealthiness.



Paperid:205
Authors:Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, Xing Sun
Tencent Youtu Lab, Shanghai Jiao Tong University, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Tencent Youtu Lab
Abstract:
During the preceding biennium, visionlanguage pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.



Paperid:206
Authors:Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, Anand Mishra
Indian Institute of Technology Jodhpur, Indian Institute of Technology Jodhpur, Indian Institute of Technology Jodhpur, Microsoft, Indian Institute of Technology Jodhpur
Abstract:
Nonnative speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for ‘numbats.’ Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., “numbat digging in the ground.” In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name but easy-to-draw” objects and text describing “difficult-to-sketch but easy-to-verbalize” object's attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of ~2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNet (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at: https://vl2g.github.io/projects/cstbir.



Paperid:207
Authors:Chengjie Ge, Xueyang Fu, Peng He, Kunyu Wang, Chengzhi Cao, Zheng-Jun Zha
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of science and technology of china, University of Science and Technology of China, University of Science and Technology of China
Abstract:
Convolutional neural networksbased video de-raining methods commonly rely on dense intensity frames captured by CMOS sensors. However, the limited temporal resolution of these sensors hinders the capture of dynamic rainfall information, limiting further improvement in de-raining performance. This study aims to overcome this issue by incorporating the neuromorphic event signal into the video de-raining to enhance the dynamic information perception. Specifically, we first utilize the dynamic information from the event signal as prior knowledge, and integrate it into existing de-raining objectives to better constrain the solution space. We then design an optimization algorithm to solve the objective, and construct a de-raining network with CNNs as the backbone architecture using a modular strategy to mimic the optimization process. To further explore the temporal correlation of the event signal, we incorporate a spiking self-attention module into our network. By leveraging the low latency and high temporal resolution of the event signal, along with the spatial and temporal representation capabilities of convolutional and spiking neural networks, our model captures more accurate dynamic information and significantly improves de-raining performance. For example, our network achieves a 1.24dB improvement on the SynHeavy25 dataset compared to the previous state-of-the-art method, while utilizing only 39% of the parameters.



Paperid:208
Authors:Yanqi Ge, Qiang Nie, Ye Huang, Yong Liu, Chengjie Wang, Feng Zheng, Wen Li, Lixin Duan
Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Tencent Youtu Lab, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Tencent Youtu Lab, Tencent Youtu Lab Shanghai Jiao Tong University, Southern University of Science and Technology, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China Sichuan Provincial People’s Hospital
Abstract:
One of the ultimate goals of representation learning is to achieve compactness within a class and wellseparability between classes. Many outstanding metric-based and prototype-based methods following the Expectation-Maximization paradigm, have been proposed for this objective. However, they inevitably introduce biases into the learning process, particularly with long-tail distributed training data. In this paper, we reveal that the class prototype is not necessarily to be derived from training features and propose a novel perspective to use pre-defined class anchors serving as feature centroid to unidirectionally guide feature learning. However, the pre-defined anchors may have a large semantic distance from the pixel features, which prevents them from being directly applied. To address this issue and generate feature centroid independent from feature learning, a simple yet effective Semantic Anchor Regularization (SAR) is proposed. SAR ensures the inter-class separability of semantic anchors in the semantic space by employing a classifier-aware auxiliary cross-entropy loss during training via disentanglement learning. By pulling the learned features to these semantic anchors, several advantages can be attained: 1) the intra-class compactness and naturally inter-class separability, 2) induced bias or errors from feature learning can be avoided, and 3) robustness to the long-tailed problem. The proposed SAR can be used in a plug-and-play manner in the existing models. Extensive experiments demonstrate that the SAR performs better than previous sophisticated prototype-based methods. The implementation is available at https://github.com/geyanqi/SAR.



Paperid:209
Authors:Wenjia Geng, Yong Liu, Lei Chen, Sujia Wang, Jie Zhou, Yansong Tang
Shenzhen International Graduate School,Tsinghua University, Shenzhen International Graduate School,Tsinghua University, University of Science and Technology Beijing, Shenzhen International Graduate School,Tsinghua University, Department of Automation, Tsinghua University, Shenzhen International Graduate School,Tsinghua University
Abstract:
Weakly Supervised temporal Article Grounding (WSAG) is a challenging and practical task in video understanding. Specifically, given a video and a relevant article, whose sentences are at different semantic scales, WSAG aims to localize corresponding video segments for all “groundable” sentences. Compared to other grounding tasks, e.g., localizing one target segment with respect to a given sentence query, WSAG confronts an essential obstacle rooted in the intricate multiscale information inherent within both textual and visual modalities. Existing methods overlook the modeling and alignment of such structured information present in multi-scale video segments and hierarchical textual content. To this end, we propose a Multi-Scale Video-Text Correspondence Learning (MVTCL) framework, which enhances the grounding performance in complex scenes by modeling multi-scale semantic correspondence both within and between modalities. Specifically, MVTCL initially aggregates video content spanning distinct temporal scales and leverages hierarchical textual relationships in both temporal and semantic dimensions via a semantic calibration module. Then multi-scale contrastive learning module is introduced to generate more discriminative representations by selecting typical contexts and performing inter-video contrastive learning. Through the multi-scale semantic calibration architecture and supervision design, our method achieves new state-of-the-art performance on existing WSAG benchmarks.



Paperid:210
Authors:Mohsen Gholami, Rabab Ward, Z. Jane Wang
University of British Columbia, University of British Columbia, University of British Columbia
Abstract:
This paper proposes an endto-end framework for generating 3D human pose datasets using Neural Radiance Fields (NeRF). Public datasets generally have limited diversity in terms of human poses and camera viewpoints, largely due to the resource-intensive nature of collecting 3D human pose data. As a result, pose estimators trained on public datasets significantly underperform when applied to unseen out-of-distribution samples. Previous works proposed augmenting public datasets by generating 2D-3D pose pairs or rendering a large amount of random data. Such approaches either overlook image rendering or result in suboptimal datasets for pre-trained models. Here we propose PoseGen, which learns to generate a dataset (human 3D poses and images) with a feedback loss from a given pre-trained pose estimator. In contrast to prior art, our generated data is optimized to improve the robustness of the pre-trained model. The objective of PoseGen is to learn a distribution of data that maximizes the prediction error of a given pre-trained model. As the learned data distribution contains OOD samples of the pre-trained model, sampling data from such a distribution for further fine-tuning a pre-trained model improves the generalizability of the model. This is the first work that proposes NeRFs for 3D human data generation. NeRFs are data-driven and do not require 3D scans of humans. Therefore, using NeRF for data generation is a new direction for convenient user-specific data generation. Our extensive experiments show that the proposed PoseGen improves two baseline models (SPIN and HybrIK) on four datasets with an average 6% relative improvement.



Paperid:211
Authors:Lei Gong, Yu Zhang, Yingqing Xia, Yanyong Zhang, Jianmin Ji
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
Abstract:
Nowadays, closedset perception methods for autonomous driving perform well on datasets containing normal scenes. However, they still struggle to handle anomalies in the real world, such as unknown objects that have never been seen while training. The lack of public datasets to evaluate the model performance on anomaly and corner cases has hindered the development of reliable autonomous driving systems. Therefore, we propose a multimodal Synthetic Dataset for Anomaly and Corner case detection, called SDAC, which encompasses anomalies captured from multi-view cameras and the LiDAR sensor, providing a rich set of annotations for multiple mainstream perception tasks. SDAC is the first public dataset for autonomous driving that categorizes anomalies into object, scene, and scenario levels, allowing the evaluation under different anomalous conditions. Experiments show that closed-set models suffer significant performance drops on anomaly subsets in SDAC. Existing anomaly detection methods fail to achieve satisfactory performance, suggesting that anomaly detection remains a challenging problem. We anticipate that our SDAC dataset could foster the development of safe and reliable systems for autonomous driving.



Paperid:212
Authors:Dongjun Gu, Jaehyeok Shim, Jaehoon Jang, Changwoo Kang, Kyungdon Joo
UNIST, UNIST, UNIST, UNIST, UNIST
Abstract:
Among various interactions between humans, such as eye contact and gestures, physical interactions by contact can act as an essential moment in understanding human behaviors. Inspired by this fact, given a 3D partner human with the desired interaction label, we introduce a new task of 3D human generation in terms of physical contact. Unlike previous works of interacting with static objects or scenes, a given partner human can have diverse poses and different contact regions according to the type of interaction. To handle this challenge, we propose a novel method of generating interactive 3D humans for a given partner human based on a guided diffusion framework (ContactGen in short). Specifically, we newly present a contact prediction module that adaptively estimates potential contact regions between two input humans according to the interaction label. Using the estimated potential contact regions as complementary guidances, we dynamically enforce ContactGen to generate interactive 3D humans for a given partner human within a guided diffusion model. We demonstrate ContactGen on the CHI3D dataset, where our method generates physically plausible and diverse poses compared to comparison methods.



Paperid:213
Authors:Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Objecteye Inc., Beijing, China, Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China Objecteye Inc., Beijing, China, Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China Objecteye Inc., Beijing, China
Abstract:
Large VisionLanguage Models (LVLMs) such as MiniGPT-4 and LLaVA have demonstrated the capability of understanding images and achieved remarkable performance in various visual tasks. Despite their strong abilities in recognizing common objects due to extensive training datasets, they lack specific domain knowledge and have a weaker understanding of localized details within objects, which hinders their effectiveness in the Industrial Anomaly Detection (IAD) task. On the other hand, most existing IAD methods only provide anomaly scores and necessitate the manual setting of thresholds to distinguish between normal and abnormal samples, which restricts their practical implementation. In this paper, we explore the utilization of LVLM to address the IAD problem and propose AnomalyGPT, a novel IAD approach based on LVLM. We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. We also employ an image decoder to provide fine-grained semantic and design a prompt learner to fine-tune the LVLM using prompt embeddings. Our AnomalyGPT eliminates the need for manual threshold adjustments, thus directly assesses the presence and locations of anomalies. Additionally, AnomalyGPT supports multi-turn dialogues and exhibits impressive few-shot in-context learning capabilities. With only one normal shot, AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset.



Paperid:214
Authors:Huankang Guan, Rynson W.H. Lau
City University of Hong Kong, City University of Hong Kong
Abstract:
Salient Object Ranking (SOR) is the process of predicting the order of an observer's attention to objects when viewing a complex scene. Existing SOR methods primarily focus on ranking various scene objects simultaneously by exploring their spatial and semantic properties. However, their solutions of simultaneously ranking all salient objects do not align with human viewing behavior, and may result in incorrect attention shift predictions. We observe that humans view a scene through a sequential and continuous process involving a cycle of foveating to objects of interest with our foveal vision while using peripheral vision to prepare for the next fixation location. For instance, when we see a flying kite, our foveal vision captures the kite itself, while our peripheral vision can help us locate the person controlling it such that we can smoothly divert our attention to it next. By repeatedly carrying out this cycle, we can gain a thorough understanding of the entire scene. Based on this observation, we propose to model the dynamic interplay between foveal and peripheral vision to predict human attention shifts sequentially. To this end, we propose a novel SOR model, SeqRank, which reproduces foveal vision to extract highacuity visual features for accurate salient instance segmentation while also modeling peripheral vision to select the object that is likely to grab the viewer’s attention next. By incorporating both types of vision, our model can mimic human viewing behavior better and provide a more faithful ranking among various scene objects. Most notably, our model improves the SA-SOR/MAE scores by +6.1%/-13.0% on IRSR, compared with the state-of-the-art. Extensive experiments show the superior performance of our model on the SOR benchmarks. Code is available at https://github.com/guanhuankang/SeqRank.



Paperid:215
Authors:Yong Guan, Freddy Lécué, Jiaoyan Chen, Ru Li, Jeff Z. Pan
Tsinghua University Shanxi University, Inria, The University of Manchester, Shanxi University, The University of Edinburgh
Abstract:
Although neural models have achieved remarkable performance, they still encounter doubts due to the intransparency. To this end, model prediction explanation is attracting more and more attentions. However, current methods rarely incorporate external knowledge and still suffer from three limitations: (1) Neglecting concept completeness. Merely selecting concepts may not sufficient for prediction. (2) Lacking concept fusion. Failure to merge semanticallyequivalent concepts. (3) Difficult in manipulating model behavior. Lack of verification for explanation on original model. To address these issues, we propose a novel knowledge-aware neuron interpretation framework to explain model predictions for image scene classification. Specifically, for concept completeness, we present core concepts of a scene based on knowledge graph, ConceptNet, to gauge the completeness of concepts. Our method, incorporating complete concepts, effectively provides better prediction explanations compared to baselines. Furthermore, for concept fusion, we introduce a knowledge graph-based method known as Concept Filtering, which produces over 23% point gain on neuron behaviors for neuron interpretation. At last, we propose Model Manipulation, which aims to study whether the core concepts based on ConceptNet could be employed to manipulate model behavior. The results show that core concepts can effectively improve the performance of original model by over 26%.



Paperid:216
Authors:Huijie Guo, Ying Ba, Jie Hu, Lingyu Si, Wenwen Qiang, Lei Shi
Beihang University, Gaoling School of Artificial Intelligence, Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods, Meituan, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Beihang University
Abstract:
SelfSupervised Learning (SSL) methods harness the concept of semantic invariance by utilizing data augmentation strategies to produce similar representations for different deformations of the same input. Essentially, the model captures the shared information among multiple augmented views of samples, while disregarding the non-shared information that may be beneficial for downstream tasks. To address this issue, we introduce a module called CompMod with Meta Comprehensive Regularization (MCR), embedded into existing self-supervised frameworks, to make the learned representations more comprehensive. Specifically, we update our proposed model through a bi-level optimization mechanism, enabling it to capture comprehensive features. Additionally, guided by the constrained extraction of features using maximum entropy coding, the self-supervised learning model learns more comprehensive features on top of learning consistent features. In addition, we provide theoretical support for our proposed method from information theory and causal counterfactual perspective. Experimental results show that our method achieves significant improvement in classification, object detection and semantic segmentation tasks on multiple benchmark datasets.



Paperid:217
Authors:Junwen Guo, Guobao Xiao, Shiping Wang, Jun Yu
Tongji University Fuzhou University, Tongji University, Fuzhou University, Hangzhou Dianzi University
Abstract:
Most of existing correspondence pruning methods only concentrate on gathering the context information as much as possible while neglecting effective ways to utilize such information. In order to tackle this dilemma, in this paper we propose Graph Context Transformation Network (GCTNet) enhancing context information to conduct consensus guidance for progressive correspondence pruning. Specifically, we design the Graph Context Enhance Transformer which first generates the graph network and then transforms it into multi-branch graph contexts. Moreover, it employs self-attention and cross-attention to magnify characteristics of each graph context for emphasizing the unique as well as shared essential information. To further apply the recalibrated graph contexts to the global domain, we propose the Graph Context Guidance Transformer. This module adopts a confident-based sampling strategy to temporarily screen high-confidence vertices for guiding accurate classification by searching global consensus between screened vertices and remaining ones. The extensive experimental results on outlier removal and relative pose estimation clearly demonstrate the superior performance of GCT-Net compared to state-of-the-art methods across outdoor and indoor datasets.



Paperid:218
Authors:Shuai Guo, Qiuwen Wang, Yijie Gao, Rong Xie, Li Song
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Novelview synthesis with sparse input views is important for real-world applications like AR/VR and autonomous driving. Recent methods have integrated depth information into NeRFs for sparse input synthesis, leveraging depth prior for geometric and spatial understanding. However, most existing works tend to overlook inaccuracies within depth maps and have low time efficiency. To address these issues, we propose a depth-guided robust and fast point cloud fusion NeRF for sparse inputs. We perceive radiance fields as an explicit voxel grid of features. A point cloud is constructed for each input view, characterized within the voxel grid using matrices and vectors. We accumulate the point cloud of each input view to construct the fused point cloud of the entire scene. Each voxel determines its density and appearance by referring to the point cloud of the entire scene. Through point cloud fusion and voxel grid fine-tuning, inaccuracies in depth values are refined or substituted by those from other views. Moreover, our method can achieve faster reconstruction and greater compactness through effective vector-matrix decomposition. Experimental results underline the superior performance and time efficiency of our approach compared to state-of-the-art baselines.



Paperid:219
Authors:Tianyu Guo, Haowei Wang, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Abstract:
Recent advancements in singlestage Panoptic Narrative Grounding (PNG) have demonstrated significant potential. These methods predict pixel-level masks by directly matching pixels and phrases. However, they often neglect the modeling of semantic and visual relationships between phrase-level instances, limiting their ability for complex multi-modal reasoning in PNG. To tackle this issue, we propose XPNG, a “differentiation-refinement-localization” reasoning paradigm for accurately locating instances or regions. In XPNG, we introduce a Semantic Context Convolution (SCC) module to leverage semantic priors for generating distinctive features. This well-crafted module employs a combination of dynamic channel-wise convolution and pixel-wise convolution to embed semantic information and establish inter-object relationships guided by semantics. Subsequently, we propose a Visual Context Verification (VCV) module to provide visual cues, eliminating potential space biases introduced by semantics and further refining the visual features generated by the previous module. Extensive experiments on PNG benchmark datasets reveal that our approach achieves state-of-the-art performance, significantly outperforming existing methods by a considerable margin and yielding a 3.9-point improvement in overall metrics. Our codes and results are available at our project webpage: https://github.com/TianyuGoGO/XPNG.



Paperid:220
Authors:Wei Guo, Yuqi Zhang, De Ma, Qian Zheng
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
Recent advancement in computer vision has significantly lowered the barriers to artistic creation. Exemplarbased image translation methods have attracted much attention due to flexibility and controllability. However, these methods hold assumptions regarding semantics or require semantic information as the input, while accurate semantics is not easy to obtain in artistic images. Besides, these methods suffer from cross-domain artifacts due to training data prior and generate imprecise structure due to feature compression in the spatial domain. In this paper, we propose an arbitrary Style Image Manipulation Network (SIM-Net), which leverages semantic-free information as guidance and a region transportation strategy in a self-supervised manner for image generation. Our method balances computational efficiency and high resolution to a certain extent. Moreover, our method facilitates zero-shot style image manipulation. Both qualitative and quantitative experiments demonstrate the superiority of our method over state-of-the-art methods.Code is available at https://github.com/SnailForce/SIM-Net.



Paperid:221
Authors:Wengang Guo, Jiayi Yang, Huilin Yin, Qijun Chen, Wei Ye
Tongji University, Tongji University, Tongji University, Tongji University, Tongji University
Abstract:
Convolutional Neural Networks (CNNs) have exhibited great performance in discriminative feature learning for complex visual tasks. Besides discrimination power, interpretability is another important yet underexplored property for CNNs. One difficulty in the CNN interpretability is that filters and image classes are entangled. In this paper, we introduce a novel pathway to alleviate the entanglement between filters and image classes. The proposed pathway groups the filters in a late conv-layer of CNN into class-specific clusters. Clusters and classes are in a one-to-one relationship. Specifically, we use the Bernoulli sampling to generate the filter-cluster assignment matrix from a learnable filter-class correspondence matrix. To enable end-to-end optimization, we develop a novel reparameterization trick for handling the non-differentiable Bernoulli sampling. We evaluate the effectiveness of our method on ten widely used network architectures (including nine CNNs and a ViT) and five benchmark datasets. Experimental results have demonstrated that our method PICNN (the combination of standard CNNs with our proposed pathway) exhibits greater interpretability than standard CNNs while achieving higher or comparable discrimination power.



Paperid:222
Authors:Vinayak Gupta, Rahul Goel, Sirikonda Dhawal, P. J. Narayanan
Indian Institute of Technology, Madras, International Institute of Information Technology, Hyderabad, International Institute of Information Technology, Hyderabad, International Institute of Information Technology, Hyderabad
Abstract:
Traditional Radiance Field (RF) representations capture details of a specific scene and must be trained afresh on each scene. Semantic feature fields have been added to RFs to facilitate several segmentation tasks. Generalised RF representations learn the principles of view interpolation. A generalised RF can render new views of an unknown and untrained scene, given a few views. We present a way to distil feature fields into the generalised GNT representation. Our GSN representation generates new views of unseen scenes on the fly along with consistent, perpixel semantic features. This enables multi-view segmentation of arbitrary new scenes. We show different semantic features being distilled into generalised RFs. Our multi-view segmentation results are on par with methods that use traditional RFs. GSN closes the gap between standard and generalisable RF methods significantly. Project Page: https://vinayak-vg.github.io/GSN/



Paperid:223
Authors:Bo Han, Hao Peng, Minjing Dong, Yi Ren, Yixuan Shen, Chang Xu
Zhejiang University, Unity China, University of Sydney, Zhejiang Univerisity, National University of Singapore, University of Sydney
Abstract:
Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. The challenges are twofold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. Furthermore, we present its generalization for X-to-Motion with “No Modality Left Behind”, enabling for the first time the generation of high-definition and high-fidelity human motions based on user-defined modality input.



Paperid:224
Authors:Gaoge Han, Shaoli Huang, Mingming Gong, Jinglei Tang
Northwest A&F University, Tencent AI-Lab, The University of Melbourne Mohamed bin Zayed University of Artificial Intelligence, Northwest A&F University
Abstract:
We introduce HuTuMotion, an innovative approach for generating natural human motions that navigates latent motion diffusion models by leveraging fewshot human feedback. Unlike existing approaches that sample latent variables from a standard normal prior distribution, our method adapts the prior distribution to better suit the characteristics of the data, as indicated by human feedback, thus enhancing the quality of motion generation. Furthermore, our findings reveal that utilizing few-shot feedback can yield performance levels on par with those attained through extensive human feedback. This discovery emphasizes the potential and efficiency of incorporating few-shot human-guided optimization within latent diffusion models for personalized and style-aware human motion generation applications. The experimental results show the significantly superior performance of our method over existing state-of-the-art approaches.



Paperid:225
Authors:Mengqiao Han, Liyuan Pan, Xiabi Liu
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology
Abstract:
The artificial neuron (NN) model-based networks have accomplished extraordinary success for various vision tasks. However, as a simplification of the mammal neuron model, their structure is locked during training, resulting in overfitting and over-parameters. The astrocyte, newly explored by biologists, can adaptively modulate neuronal communication by inserting itself between neurons. The communication, between the astrocyte and neuron, is bidirectionally and shows the potential to alleviate issues raised by unidirectional communication in the N-N model. In this paper, we first elaborate on the artificial Multi-Astrocyte-Neuron (MA-N) model, which enriches the functionality of the artificial neuron model. Our MA-N model is formulated at both astrocyte- and neuron-level that mimics the bidirectional communication with temporal and joint mechanisms. Then, we construct the MA-Net network with the MA-N model, whose neural connections can be continuously and adaptively modulated during training. Experiments show that our MA-Net advances new state-of-the-art on multiple tasks while significantly reducing its parameters by connection optimization.



Paperid:226
Authors:Yucheng Han, Na Zhao, Weiling Chen, Keng Teck Ma, Hanwang Zhang
Nanyang Technological University, Singapore University of Technology and Design, Hyundai Motor Group Innovation Center in Singapore, Hyundai Motor Group Innovation Center in Singapore, Nanyang Technological University
Abstract:
Semisupervised 3D object detection is a promising yet under-explored direction to reduce data annotation costs, especially for cluttered indoor scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this task by utilizing a teacher model to generate pseudo-labels for unlabeled samples. However, the availability of unlabeled samples in the 3D domain is relatively limited compared to its 2D counterpart due to the greater effort required to collect 3D data. Moreover, the loose consistency regularization in SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to either low-quality supervision or a limited amount of pseudo labels. To address these issues, we present a novel Dual-Perspective Knowledge Enrichment approach named DPKE for semi-supervised 3D object detection. Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective. Specifically, from the data-perspective, we propose a class-probabilistic data augmentation method that augments the input data with additional instances based on the varying distribution of class probabilities. Our DPKE achieves feature-perspective knowledge enrichment by designing a geometry-aware feature matching method that regularizes feature-level similarity between object proposals from the student and teacher models. Extensive experiments on the two benchmark datasets demonstrate that our DPKE achieves superior performance over existing state-of-the-art approaches under various label ratio conditions. The source code and models will be made available to the public.



Paperid:227
Authors:Yudong Han, Yupeng Hu, Xuemeng Song, Haoyu Tang, Mingzhu Xu, Liqiang Nie
School of Software, Shandong University, School of Software, Shandong University, School of Computer Science and Technology, Shandong University, School of Software, Shandong University, School of Software, Shandong University, School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)
Abstract:
Benefiting from instrumental global dependency modeling of selfattention (SA), transformer-based approaches have become the pivotal choices for numerous downstream visual reasoning tasks, such as visual question answering (VQA) and referring expression comprehension (REC). However, some studies have recently suggested that SA tends to suffer from rank collapse thereby inevitably leads to representation degradation as the transformer layer goes deeper. Inspired by social network theory, we attempt to make an analogy between social behavior and regional information interaction in SA, and harness two crucial notions of structural hole and degree centrality in social network to explore the possible optimization towards SA learning, which naturally deduces two plug-and-play social-like modules. Based on structural hole, the former module allows to make information interaction in SA more structured, which effectively avoids redundant information aggregation and global feature homogenization for better rank remedy, followed by latter module to comprehensively characterize and refine the representation discrimination via considering degree centrality of regions and transitivity of relations. Without bells and whistles, our model outperforms a bunch of baselines by a noticeable margin when considering our social-like prior on five benchmarks in VQA and REC tasks, and a series of explanatory results are showcased to sufficiently reveal the social-like behaviors in SA.



Paperid:228
Authors:Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, Yiran Zhong
Bilibili Inc., OpenNLPLab, Shanghai AI Lab Northwestern Polytechnical University, NIO, OpenNLPLab, Shanghai AI Lab, Northwestern Polytechnical University, OpenNLPLab, Shanghai AI Lab
Abstract:
The aim of audiovisual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. Code is released in: https://github.com/OpenNLPLab/AVS-bidirectional.



Paperid:229
Authors:Yuze Hao, Jianrong Zhang, Tao Zhuo, Fuan Wen, Hehe Fan
Zhejiang University Beijing University of Posts and Telecommunications, Zhejiang University, Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture, Zhejiang University
Abstract:
Hands are the main medium when people interact with the world. Generating proper 3D motion for handobject interaction is vital for applications such as virtual reality and robotics. Although grasp tracking or object manipulation synthesis can produce coarse hand motion, this kind of motion is inevitably noisy and full of jitter. To address this problem, we propose a data-driven method for coarse motion refinement. First, we design a hand-centric representation to describe the dynamic spatial-temporal relation between hands and objects. Compared to the object-centric representation, our hand-centric representation is straightforward and does not require an ambiguous projection process that converts object-based prediction into hand motion. Second, to capture the dynamic clues of hand-object interaction, we propose a new architecture that models the spatial and temporal structure in a hierarchical manner. Extensive experiments demonstrate that our method outperforms previous methods by a noticeable margin.



Paperid:230
Authors:Jingxuan He, Lechao Cheng, Chaowei Fang, Zunlei Feng, Tingting Mu, Mingli Song
Zhejiang Lab, Zhejiang Lab, Xidian University, Zhejiang University, University of Manchester, Zhejiang University
Abstract:
Compared to conventional semantic segmentation with pixellevel supervision, weakly supervised semantic segmentation (WSSS) with image-level labels poses the challenge that it commonly focuses on the most discriminative regions, resulting in a disparity between weakly and fully supervision scenarios. A typical manifestation is the diminished precision on object boundaries, leading to deteriorated accuracy of WSSS. To alleviate this issue, we propose to adaptively partition the image content into certain regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing. For uncertain cues, we propose an adaptive masking strategy and seek to recover the local information with self-distilled knowledge. We further assume that confident regions should be robust enough to preserve the global semantics, and introduce a complementary self-distillation method that constrains semantic consistency between confident regions and an augmented view with the same class labels. Extensive experiments conducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed single-stage approach for WSSS not only outperforms state-of-the-art counterparts but also surpasses multi-stage methods that trade complexity for accuracy.



Paperid:231
Authors:Qibin He
University of Chinese Academy of Sciences
Abstract:
Multimodal image segmentation is one of the core issues in computer vision. The main challenge lies in integrating common information between modalities while retaining specific patterns for each modality. Existing methods typically perform full fine-tuning on RGB-based pre-trained parameters to inherit the powerful representation of the foundation model. Although effective, such paradigm is not optimal due to weak transferability and scarce downstream data. Inspired by the recent success of prompt learning in language models, we propose the Grouping Prompt Tuning Framework (GoPT), which introduces explicit semantic grouping to learn modal-related prompts, adapting the frozen pre-trained foundation model to various downstream multi-modal segmentation tasks. Specifically, a class-aware uni-modal prompter is designed to balance intra- and inter-modal semantic propagation by grouping modality-specific class tokens, thereby improving the adaptability of spatial information. Furthermore, an alignment-induced cross-modal prompter is introduced to aggregate class-aware representations and share prompt parameters among different modalities to assist in modeling common statistics. Extensive experiments show the superiority of our GoPT, which achieves SOTA performance on various downstream multi-modal image segmentation tasks by training only < 1% model parameters.



Paperid:232
Authors:Ruian He, Shili Zhou, Yuqi Sun, Ri Cheng, Weimin Tan, Bo Yan
School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
Abstract:
With the rise of realtime rendering and the evolution of display devices, there is a growing demand for post-processing methods that offer high-resolution content in a high frame rate. Existing techniques often suffer from quality and latency issues due to the disjointed treatment of frame supersampling and extrapolation. In this paper, we recognize the shared context and mechanisms between frame supersampling and extrapolation, and present a novel framework, Space-time Supersampling (STSS). By integrating them into a unified framework, STSS can improve the overall quality with lower latency. To implement an efficient architecture, we treat the aliasing and warping holes unified as reshading regions and put forth two key components to compensate the regions, namely Random Reshading Masking (RRM) and Efficient Reshading Module (ERM). Extensive experiments demonstrate that our approach achieves superior visual fidelity compared to state-of-the-art (SOTA) methods. Notably, the performance is achieved within only 4ms, saving up to 75\% of time against the conventional two-stage pipeline that necessitates 17ms.



Paperid:233
Authors:Tianyao He, Huabin Liu, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang, Weiyao Lin
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, AI Lab, Lenovo Research, AI Lab, Lenovo Research, AI Lab, Lenovo Research, Shanghai Jiao Tong university
Abstract:
Video Correlation Learning (VCL), which aims to analyze the relationships between videos, has been widely studied and applied in various general video tasks. However, applying VCL to instructional videos is still quite challenging due to their intrinsic procedural temporal structure. Specifically, procedural knowledge is critical for accurate correlation analyses on instructional videos. Nevertheless, current procedurelearning methods heavily rely on step-level annotations, which are costly and not scalable. To address this problem, we introduce a weakly supervised framework called Collaborative Procedure Alignment (CPA) for procedure-aware correlation learning on instructional videos. Our framework comprises two core modules: collaborative step mining and frame-to-step alignment. The collaborative step mining module enables simultaneous and consistent step segmentation for paired videos, leveraging the semantic and temporal similarity between frames. Based on the identified steps, the frame-to-step alignment module performs alignment between the frames and steps across videos. The alignment result serves as a measurement of the correlation distance between two videos. We instantiate our framework in two distinct instructional video tasks: sequence verification and action quality assessment. Extensive experiments validate the effectiveness of our approach in providing accurate and interpretable correlation analyses for instructional videos.



Paperid:234
Authors:Xuanhua He, Keyu Yan, Rui Li, Chengjun Xie, Jie Zhang, Man Zhou
Hefei Institutes of Physical Science, Chinese Academy of Sciences University of Science and Technology of China, Hefei Institutes of Physical Science, Chinese Academy of Sciences University of Science and Technology of China, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Nanyang Technological University
Abstract:
Pansharpening involves reconstructing missing high-frequency information in multi-spectral images with low spatial resolution, using a higher-resolution panchromatic image as guidance. Although the inborn connection with frequency domain, existing pan-sharpening research has not almost investigated the potential solution upon frequency domain. To this end, we propose a novel Frequency Adaptive Mixture of Experts (FAME) learning framework for pan-sharpening, which consists of three key components: the Adaptive Frequency Separation Prediction Module, the Sub-Frequency Learning Expert Module, and the Expert Mixture Module. In detail, the first leverages the discrete cosine transform to perform frequency separation by predicting the frequency mask. On the basis of generated mask, the second with low-frequency MOE and high-frequency MOE takes account for enabling the effective low-frequency and high-frequency information reconstruction. Followed by, the final fusion module dynamically weights high frequency and low-frequency MOE knowledge to adapt to remote sensing images with significant content variations. Quantitative and qualitative experiments over multiple datasets demonstrate that our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes. Code will be made publicly at https://github.com/alexhe101/FAME-Net.



Paperid:235
Authors:Xuanhua He, Tao Hu, Guoli Wang, Zejin Wang, Run Wang, Qian Zhang, Keyu Yan, Ziyi Chen, Rui Li, Chengjun Xie, Jie Zhang, Man Zhou
Hefei Institutes of Physical Science, Chinese Academy of Sciences University of Science and Technology of China, Hefei Institutes of Physical Science, Chinese Academy of Sciences University of Science and Technology of China, Horizon Robotics, Horizon Robotics, Horizon Robotics, Horizon Robotics, Hefei Institutes of Physical Science, Chinese Academy of Sciences University of Science and Technology of China, Tencent Technology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Nanyang Technological University
Abstract:
RAW to sRGB mapping, which aims to convert RAW images from smartphones into RGB form equivalent to that of Digital SingleLens Reflex (DSLR) cameras, has become an important area of research. However, current methods often ignore the difference between cell phone RAW images and DSLR camera RGB images, a difference that goes beyond the color matrix and extends to spatial structure due to resolution variations. Recent methods directly rebuild color mapping and spatial structure via shared deep representation, limiting optimal performance. Inspired by Image Signal Processing (ISP) pipeline, which distinguishes image restoration and enhancement, we present a novel Neural ISP framework, named FourierISP. This approach breaks the image down into style and structure within the frequency domain, allowing for independent optimization. FourierISP is comprised of three subnetworks: Phase Enhance Subnet for structural refinement, Amplitude Refine Subnet for color learning, and Color Adaptation Subnet for blending them in a smooth manner. This approach sharpens both color and structure, and extensive evaluations across varied datasets confirm that our approach realizes state-of-the-art results. Code will be available at https://github.com/alexhe101/FourierISP.



Paperid:236
Authors:Nailei Hei, Qianyu Guo, Zihao Wang, Yan Wang, Haofen Wang, Wenqiang Zhang
Fudan University, Fudan University, Tongji University, Fudan University, Tongji University, Fudan University
Abstract:
Welldesigned prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution gap between user input behavior and model training datasets, we first construct a novel Coarse-Fine Granularity Prompts dataset (CFP) and propose a novel User-Friendly Fine-Grained Text Generation framework (UF-FGTG) for automated prompt optimization. For CFP, we construct a novel dataset for text-to-image tasks that combines coarse and fine-grained prompts to facilitate the development of automated prompt generation methods. For UF-FGTG, we propose a novel framework that automatically translates user-input prompts into model-preferred prompts. Specifically, we propose a prompt refiner that continually rewrites prompts to empower users to select results that align with their unique needs. Meanwhile, we integrate image-related loss functions from the text-to-image model into the training process of text generation to generate model-preferred prompts. Additionally, we propose an adaptive feature extraction module to ensure diversity in the generated results. Experiments demonstrate that our approach is capable of generating more visually appealing and diverse images than previous state-of-the-art methods, achieving an average improvement of 5% across six quality and aesthetic metrics. Data and code are available at https://github.com/Naylenv/UF-FGTG.



Paperid:237
Authors:Or Hirschorn, Amir Jevnisek, Shai Avidan
Tel Aviv University, Tel Aviv University, Tel Aviv University
Abstract:
Vector image representation is a popular choice when editability and flexibility in resolution are desired. However, most images are only available in raster form, making rasterto-vector image conversion (vectorization) an important task. Classical methods for vectorization are either domain-specific or yield an abundance of shapes which limits editability and interpretability. Learning-based methods, that use differentiable rendering, have revolutionized vectorization, at the cost of poor generalization to out-of-training distribution domains, and optimization-based counterparts are either slow or produce non-editable and redundant shapes. In this work, we propose Optimize & Reduce (O&R), a top-down approach to vectorization that is both fast and domain-agnostic. O&R aims to attain a compact representation of input images by iteratively optimizing Bezier curve parameters and significantly reducing the number of shapes, using a devised importance measure. We contribute a benchmark of five datasets comprising images from a broad spectrum of image complexities - from emojis to natural-like images. Through extensive experiments on hundreds of images, we demonstrate that our method is domain agnostic and outperforms existing works in both reconstruction and perceptual quality for a fixed number of shapes. Moreover, we show that our algorithm is x10 faster than the state-of-the-art optimization-based method. Our code is publicly available: https://github.com/ajevnisek/optimize-and-reduce



Paperid:238
Authors:Nhat M. Hoang, Kehong Gong, Chuan Guo, Michael Bi Mi
Huawei Technologies Co., Ltd. Nanyang Technological University, Huawei Technologies Co., Ltd., Huawei Technologies Co., Ltd., Huawei Technologies Co., Ltd.
Abstract:
Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) highquality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial T-T* steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last T* steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.



Paperid:239
Authors:Meghana Holla, Ismini Lourentzou
Virginia Tech, University of Illinois at Urbana-Champaign
Abstract:
Zeroshot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.



Paperid:240
Authors:James Hong, Lu Yuan, Michaël Gharbi, Matthew Fisher, Kayvon Fatahalian
Stanford University, Stanford University, Adobe Research, Adobe Research, Stanford University
Abstract:
How to frame (or crop) a photo often depends on the image subject and its context; e.g., a human portrait. Recent works have defined the subjectaware image cropping task as a nuanced and practical version of image cropping. We propose a weakly-supervised approach (GenCrop) to learn what makes a high-quality, subject-aware crop from professional stock images. Unlike supervised prior work, GenCrop requires no new manual annotations beyond the existing stock image collection. The key challenge in learning from this data, however, is that the images are already cropped and we do not know what regions were removed. Our insight is to combine a library of stock images with a modern, pre-trained text-to-image diffusion model. The stock image collection provides diversity, and its images serve as pseudo-labels for a good crop. The text-image diffusion model is used to out-paint (i.e., outward inpainting) realistic uncropped images. Using this procedure, we are able to automatically generate a large dataset of cropped-uncropped training pairs to train a cropping model. Despite being weakly-supervised, GenCrop is competitive with state-of-the-art supervised methods and significantly better than comparable weakly-supervised baselines on quantitative and qualitative evaluation metrics.



Paperid:241
Authors:Chen Hou, Guoqiang Wei, Zhibo Chen
University of Science and Technology of China, Bytedance, University of Science and Technology of China
Abstract:
Diffusion models have attained remarkable success in the domains of image generation and editing. It is widely recognized that employing larger inversion and denoising steps in diffusion model leads to improved image reconstruction quality. However, the editing performance of diffusion models tends to be no more satisfactory even with increasing denoising steps. The deficiency in editing could be attributed to the conditional Markovian property of the editing process, where errors accumulate throughout denoising steps. To tackle this challenge, we first propose an innovative framework where a rectifier module is incorporated to modulate diffusion model weights with residual features from the original images, thereby providing compensatory information to bridge the fidelity gap. Furthermore, we introduce a novel learning paradigm aimed at minimizing error propagation during the editing process, which trains the editing procedure in a manner similar to denoising scorematching. Extensive experiments demonstrate that our proposed framework and training strategy achieve high-fidelity reconstruction and editing results across various levels of denoising steps, meanwhile exhibits exceptional performance in terms of both quantitative metric and qualitative assessments. Lastly, we explore our model's generalization though several applications like image-to-image translation and out-of-domain image editing.



Paperid:242
Authors:Chengyang Hu, Ke-Yue Zhang, Taiping Yao, Shice Liu, Shouhong Ding, Xin Tan, Lizhuang Ma
Shanghai Jiao Tong University, Youtu Lab, Tencent, Youtu Lab, Tencent, Youtu Lab, Tencent, Youtu Lab, Tencent, East China Normal University, Shanghai Jiao Tong University East China Normal University MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Abstract:
MultiDomain Face Anti-Spoofing (MD-FAS) is a practical setting that aims to update models on new domains using only novel data while ensuring that the knowledge acquired from previous domains is not forgotten. Prior methods utilize the responses from models to represent the previous domain knowledge or map the different domains into separated feature spaces to prevent forgetting. However, due to domain gaps, the responses of new data are not as accurate as those of previous data. Also, without the supervision of previous data, separated feature spaces might be destroyed by new domains while updating, leading to catastrophic forgetting. Inspired by the challenges posed by the lack of previous data, we solve this issue from a new standpoint that generates hallucinated previous data for updating FAS model. To this end, we propose a novel Domain-Hallucinated Updating (DHU) framework to facilitate the hallucination of data. Specifically, Domain Information Explorer learns representative domain information of the previous domains. Then, Domain Information Hallucination module transfers the new domain data to pseudo-previous domain ones. Moreover, Hallucinated Features Joint Learning module is proposed to asymmetrically align the new and pseudo-previous data for real samples via dual levels to learn more generalized features, promoting the results on all domains. Our experimental results and visualizations demonstrate that the proposed method outperforms state-of-the-art competitors in terms of effectiveness.



Paperid:243
Authors:Chunyu Hu, Hong Zhang, Chao Liang, Hao Huang
Wuhan University, Wuhan University, Wuhan University, Wuhan University
Abstract:
Ranking aggregation (RA), the process of aggregating multiple rankings derived from multiple search strategies, has been proved effective in person reidentification (re-ID) because of a single re-ID method can not always achieve consistent superiority for different scenarios. Existing RA research mainly focus on unsupervised and fully-supervised methods. The former lack external supervision to optimize performance, while the latter are costly because of expensive labeling effort required for training. To address the above challenges, this paper proposes a quantum-inspired interactive ranking aggregation (QI-IRA) method, which (1) utilizes quantum theory to interpret and model the generation and aggregation of multiple basic rankings, (2) approximates or even exceeds the performance of fully-supervised RA methods with much less labeling cost, even as low as only two feedbacks per query on Market1501, MARS and DukeMTMC-VideoReID datasets. Comparative experiments conducted on six public re-ID datasets validate the superiority of the proposed QI-IRA method over existing unsupervised, interactive, and fully-supervised RA approaches.



Paperid:244
Authors:Haoxiang Hu, Cangjun Gao, Yaokun Li, Xiaoming Deng, YuKun Lai, Cuixia Ma, Yong-Jin Liu, Hongan Wang
Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Cardiff University, Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Tsinghua University, Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Online handwriting recognition is pivotal in domains like notetaking, education, healthcare, and office tasks. Existing diagram recognition algorithms mainly rely on the temporal information of strokes, resulting in a decline in recognition performance when dealing with notes that have been modified or have no temporal information. The current datasets are drawn based on templates and cannot reflect the real free-drawing situation. To address these challenges, we present SpaceGTN, a time-agnostic Graph Transformer Network, leveraging spatial integration and removing the need for temporal data. Extensive experiments on multiple datasets have demonstrated that our method consistently outperforms existing methods and achieves state-of-the-art performance. We also propose a pipeline that seamlessly connects offline and online handwritten diagrams. By integrating a stroke restoration technique with SpaceGTN, it enables intelligent editing of previously uneditable offline diagrams at the stroke level. In addition, we have also launched the first online handwritten diagram dataset, OHSD, which is collected using a free-drawing method and comes with modification annotations.



Paperid:245
Authors:Junxing Hu, Hongwen Zhang, Zerui Chen, Mengcheng Li, Yunlong Wang, Yebin Liu, Zhenan Sun
School of Artificial Intelligence, University of Chinese Academy of Sciences State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, Beijing Normal University, Inria, DI ENS, CNRS, PSL Research University, Tsinghua University, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Tsinghua University, School of Artificial Intelligence, University of Chinese Academy of Sciences State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Abstract:
Reconstructing handheld objects from monocular RGB images is an appealing yet challenging task. In this task, contacts between hands and objects provide important cues for recovering the 3D geometry of the hand-held objects. Though recent works have employed implicit functions to achieve impressive progress, they ignore formulating contacts in their frameworks, which results in producing less realistic object meshes. In this work, we explore how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects. Our method consists of two components: explicit contact prediction and implicit shape reconstruction. In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image. The part-level and vertex-level graph-based transformers are cascaded and jointly learned in a coarse-to-fine manner for more accurate contact probabilities. In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space and leverage diffused contact probabilities to construct the implicit neural representation for the manipulated object. Benefiting from estimating the interaction patterns between the hand and the object, our method can reconstruct more realistic object meshes, especially for object parts that are in contact with hands. Extensive experiments on challenging benchmarks show that the proposed method outperforms the current state of the arts by a great margin. Our code is publicly available at https://junxinghu.github.io/projects/hoi.html.



Paperid:246
Authors:Ke Hu, Tongbo Cao, Yuan Li, Song Chen, Yi Kang
University of Science and Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China Anhui University, Hefei, China, University of Science and Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China, University of Science and Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China, University of Science and Technology of China, Hefei, China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
Abstract:
3D object detection achieves good detection performance in autonomous driving. However, it requires substantial computational resources, which prevents its practical application. 2D object detection has less computational burden but lacks spatial and geometric information embedded in depth. Therefore, we present DALDet, an efficient depthaware learning based 2D detector, achieving high-performance object detection for autonomous driving. We design an efficient one-stage detection framework and seamlessly integrate depth cues into convolutional neural network by introducing depth-aware convolution and depth-aware average pooling, which effectively improve the detector's ability to perceive 3D space. Moreover, we propose a depth-guided loss function for training DALDet, which effectively improves the localization ability of the detector. Due to the use of depth map, DALDet can also output the distance of the object, which is of great importance for driving applications such as obstacle avoidance. Extensive experiments demonstrate the superiority and efficiency of DALDet. In particular, our DALDet ranks 1st on both KITTI Car and Cyclist 2D detection test leaderboards among all 2D detectors with high efficiency as well as yielding competitive performance among many leading 3D detectors. Code will be available at https://github.com/hukefy/DALDet.



Paperid:247
Authors:Lianyu Hu, Liqing Gao, Zekang Liu, Chi-Man Pun, Wei Feng
College of Intelligence and Computing, Tianjin University, China, College of Intelligence and Computing, Tianjin University, China, College of Intelligence and Computing, Tianjin University, China, Department of Computer and Information Science, University of Macau, China, College of Intelligence and Computing, Tianjin University, China
Abstract:
Pretrained largescale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency. Code is available at https://github.com/hulianyuyy/COMMA.



Paperid:248
Authors:Vincent Tao Hu, Wei Zhang, Meng Tang, Pascal Mettes, Deli Zhao, Cees Snoek
University of Amsterdam, University of Amsterdam, University of California Merced, University of Amsterdam, Alibaba Group, University of Amsterdam
Abstract:
This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformerbased U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call u-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at https://taohu.me/lfm/



Paperid:249
Authors:Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu
University of California, San Diego, Coinbase Global, Inc., University of California, San Diego, University of California, San Diego, University of California, San Diego, University of California, San Diego
Abstract:
Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing openended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.



Paperid:250
Authors:Xiaoming Hu, Zilei Wang
University of Science and Technology of China, University of Science and Technology of China
Abstract:
To tackle the challenge of recognizing images of unseen attributeobject compositions, Compositional Zero-Shot Learning (CZSL) methods have been previously addressed. However, test images in realistic scenarios may also incorporate other forms of unknown factors, such as novel semantic concepts or novel image styles. As previous CZSL works have overlooked this critical issue, in this research, we first propose the Realistic Compositional Zero-Shot Learning (RCZSL) task which considers the various types of unknown factors in an unified experimental setting. To achieve this, we firstly conduct re-labelling on MIT-States and use the pre-trained generative models to obtain images of various domains. Then the entire dataset is split into a training set and a test set, with the latter containing images of unseen concepts, unseen compositions, unseen domains as well as their combinations. Following this, we show that the visual-semantic relationship changes on unseen images, leading us to construct two dynamic modulators to adapt the visual features and composition prototypes in accordance with the input image. We believe that such a dynamic learning method could effectively alleviate the domain shift problem caused by various types of unknown factors. We conduct extensive experiments on benchmark datasets for both the conventional CZSL setting and the proposed RCZSL setting. The effectiveness of our method has been proven by empirical results, which significantly outperformed both our baseline method and state-of-the-art approaches.



Paperid:251
Authors:Youbing Hu, Yun Cheng, Anqi Lu, Zhiqiang Cao, Dawei Wei, Jie Liu, Zhijun Li
Harbin Institute of Technology, Swiss Data Science Center, Zurich, Switzerland, Harbin Institute of Technology, Harbin Institute of technology, Xidian University, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
The Vision Transformer (ViT) excels in accuracy when handling highresolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This model operates by strategically curtailing computational demands without impinging on performance. In the Localization phase, a reduced-resolution image is processed; if a definitive prediction remains elusive, our pioneering Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively identifying and spotlighting class-discriminative regions based on initial findings. Subsequently, in the Focus phase, this designated region is used from the original image to enhance recognition. Uniquely, LF-ViT employs consistent parameters across both phases, ensuring seamless end-to-end optimization. Our empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs by 63% and concurrently amplifies throughput twofold. Code of this project is at https://github.com/edgeai1/LF-ViT.git.



Paperid:252
Authors:Yubin Hu, Sheng Ye, Wang Zhao, Matthieu Lin, Yuze He, Yu-Hui Wen, Ying He, Yong-Jin Liu
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Beijing Jiaotong University, Nanyang Technological University, Tsinghua University
Abstract:
Occlusion is a common issue in 3D reconstruction from RGBD videos, often blocking the complete reconstruction of objects and presenting an ongoing problem. In this paper, we propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects. Specifically, we utilize a pre-trained diffusion model to fill in the hidden areas of 2D images. Then we use these in-painted images to optimize a neural implicit surface representation for each instance for 3D reconstruction. Since creating the in-painting masks needed for this process is tricky, we adopt a human-in-the-loop strategy that involves very little human engagement to generate high-quality masks. Moreover, some parts of objects can be totally hidden because the videos are usually shot from limited perspectives. To ensure recovering these invisible areas, we develop a cascaded network architecture for predicting signed distance field, making use of different frequency bands of positional encoding and maintaining overall smoothness. Besides the commonly used rendering loss, Eikonal loss, and silhouette loss, we adopt a CLIP-based semantic consistency loss to guide the surface from unseen camera angles. Experiments on ScanNet scenes show that our proposed framework achieves state-of-the-art accuracy and completeness in object-level reconstruction from scene-level RGB-D videos. Code: https://github.com/THU-LYJ-Lab/O2-Recon.



Paperid:253
Authors:Cong Huang, Jiahao Li, Lei Chu, Dong Liu, Yan Lu
University of Science and Technology of China, Microsoft Research Asia, Microsoft Research Asia, University of Science and Technology of China, Microsoft Research Asia
Abstract:
We propose a Dynamic ContextGuided Upsampling (DCGU) module for video super-resolution (VSR) that leverages temporal context guidance to achieve efficient and effective arbitrary-scale VSR. While most VSR research focuses on backbone design, the importance of the upsampling part is often overlooked. Existing methods rely on pixelshuffle-based upsampling, which has limited capabilities in handling arbitrary upsampling scales. Recent attempts to replace pixelshuffle-based modules with implicit neural function-based and filter-based approaches suffer from slow inference speeds and limited representation capacity, respectively. To overcome these limitations, our DCGU module predicts non-local sampling locations and content-dependent filter weights, enabling efficient and effective arbitrary-scale VSR. Our proposed multi-granularity location search module efficiently identifies non-local sampling locations across the entire low-resolution grid, and the temporal bilateral filter modulation module integrates content information with the filter weight to enhance textual details. Extensive experiments demonstrate the superiority of our method in terms of performance and speed on arbitrary-scale VSR.



Paperid:254
Authors:Fuxiang Huang, Lei Zhang, Xiaowei Fu, Suqi Song
Chongqing University, Chongqing University, Chongqing University, Chongqing University
Abstract:
MixedModal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. However, previous approaches always achieve limited performance, due to two critical factors are seriously overlooked. 1) The contribution of image and text modalities is different, but incorrectly treated equally. 2) There exist inherent labeling noises in describing users' intentions with text in web datasets from diverse real-world scenarios, giving rise to overfitting. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges, which includes three merits. First, we propose an Editable Modality De-equalizer (EMD) by taking into account the contribution disparity between modalities, containing two modality feature editors and an adaptive weighted combiner. Second, to alleviate labeling noises and data bias, we propose a dynamic soft-similarity label generator (SSG) to implicitly improve noisy supervision. Finally, to bridge modality gaps and facilitate similarity learning, we propose a CLIP-based mutual enhancement module alternately trained by a mixed-modality contrastive loss. Extensive experiments verify that our proposed model significantly outperforms state-of-the-art methods on real-world datasets. The source code is available at https://github.com/fuxianghuang1/DWC.



Paperid:255
Authors:Han Huang, Yulun Wu, Junsheng Zhou, Ge Gao, Ming Gu, Yu-Shen Liu
Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, School of Software, Tsinghua University, Beijing, China
Abstract:
Recently, neural implicit functions have demonstrated remarkable results in the field of multiview reconstruction. However, most existing methods are tailored for dense views and exhibit unsatisfactory performance when dealing with sparse views. Several latest methods have been proposed for generalizing implicit reconstruction to address the sparse view reconstruction task, but they still suffer from high training costs and are merely valid under carefully selected perspectives. In this paper, we propose a novel sparse view reconstruction framework that leverages on-surface priors to achieve highly faithful surface reconstruction. Specifically, we design several constraints on global geometry alignment and local geometry refinement for jointly optimizing coarse shapes and fine details. To achieve this, we train a neural network to learn a global implicit field from the on-surface points obtained from SfM and then leverage it as a coarse geometric constraint. To exploit local geometric consistency, we project on-surface points onto seen and unseen views, treating the consistent loss of projected features as a fine geometric constraint. The experimental results with DTU and BlendedMVS datasets in two prevalent sparse settings demonstrate significant improvements over the state-of-the-art methods.



Paperid:256
Authors:Haofeng Huang, Wenhan Yang, Lingyu Duan, Jiaying Liu
Peking University, Peng Cheng Laboratory, Peking University, Peking University
Abstract:
Enhancing lowlight videos in a supervised style presents a set of challenges, including limited data diversity, misalignment, and the domain gap introduced through the dataset construction pipeline. Our paper tackles these challenges by constructing a self-learned enhancement approach that gets rid of the reliance on any external training data. The challenge of self-supervised learning lies in fitting high-quality signal representations solely from input signals. Our work designs a bottleneck neural representation mechanism that extracts those signals. More in detail, we encode the frame-wise representation with a compact deep embedding and utilize a neural network to parameterize the video-level manifold consistently. Then, an entropy constraint is applied to the enhanced results based on the adjacent spatial-temporal context to filter out the degraded visual signals, e.g. noise and frame inconsistency. Last, a novel Chromatic Retinex decomposition is proposed to effectively align the reflectance distribution temporally. It benefits the entropy control on different components of each frame and facilitates noise-to-noise training, successfully suppressing the temporal flicker. Extensive experiments demonstrate the robustness and superior effectiveness of our proposed method. Our project is publicly available at: https://huangerbai.github.io/SLBNR/.



Paperid:257
Authors:Huimin Huang, Yawen Huang, Shiao Xie, Lanfen Lin, Ruofeng Tong, Yen-Wei Chen, Yuexiang Li, Yefeng Zheng
Zhejiang University, Jarvis Research Center, Tencent YouTu Lab, Zhejiang University, Zhejiang University, Zhejiang University Zhejiang Lab, Ritsumeikan University, Medical AI Research Group, Guangxi Medical University, Jarvis Research Center, Tencent YouTu Lab
Abstract:
Semisupervised learning (SSL), as one of the dominant methods, aims at leveraging the unlabeled data to deal with the annotation dilemma of supervised learning, which has attracted much attentions in the medical image segmentation. Most of the existing approaches leverage a unitary network by convolutional neural networks (CNNs) with compulsory consistency of the predictions through small perturbations applied to inputs or models. The penalties of such a learning paradigm are that (1) CNN-based models place severe limitations on global learning; (2) rich and diverse class-level distributions are inhibited. In this paper, we present a novel CNN-Transformer learning framework in the manifold space for semi-supervised medical image segmentation. First, at intra-student level, we propose a novel class-wise consistency loss to facilitate the learning of both discriminative and compact target feature representations. Then, at inter-student level, we align the CNN and Transformer features using a prototype-based optimal transport method. Extensive experiments show that our method outperforms previous state-of-the-art methods on three public medical image segmentation benchmarks.



Paperid:258
Authors:Jiaxin Huang, Qi Wu, Yazhou Ren, Fan Yang, Aodi Yang, Qianqian Yang, Xiaorong Pu
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Cross domain medical image reconstruction aims to address the issue that deep learning models trained solely on one source dataset might not generalize effectively to unseen target datasets from different hospitals. Some recent methods achieve satisfactory reconstruction performance, but often at the expense of extensive parameters and time consumption. To strike a balance between cross domain image reconstruction quality and model computational efficiency, we propose a lightweight sparse Bayesian deep learning method. Notably, we apply a fixedform variational Bayes (FFVB) approach to quantify pixel-wise uncertainty priors derived from degradation distribution of the source domain. Furthermore, by integrating the uncertainty prior into the posterior sampled through stochastic gradient Langevin dynamics (SGLD), we develop a training strategy that dynamically generates and optimizes the prior distribution on the network weights for each unseen domain. This strategy enhances generalizability and ensures robust reconstruction performance. When evaluated on medical image reconstruction tasks, our proposed approach demonstrates impressive performance across various previously unseen domains.



Paperid:259
Authors:Junjia Huang, Haofeng Li, Xiang Wan, Guanbin Li
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China GuangDong Province Key Laboratory of Information Security Technology
Abstract:
The recognition of multiclass cell nuclei can significantly facilitate the process of histopathological diagnosis. Numerous pathological datasets are currently available, but their annotations are inconsistent. Most existing methods require individual training on each dataset to deduce the relevant labels and lack the use of common knowledge across datasets, consequently restricting the quality of recognition. In this paper, we propose a universal cell nucleus classification framework (UniCell), which employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains. In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets. Moreover, we develop a Dynamic Prompt Module (DPM) that exploits the properties of multiple datasets to enhance features. The DPM first integrates the embeddings of datasets and semantic categories, and then employs the integrated prompts to refine image representations, efficiently harvesting the shared knowledge among the related cell types and data sources. Experimental results demonstrate that the proposed method effectively achieves the state-of-the-art results on four nucleus detection and classification benchmarks. Code and models are available at https://github.com/lhaof/UniCell



Paperid:260
Authors:Shi-Sheng Huang, Zixin Zou, Yichi Zhang, Yan-Pei Cao, Ying Shan
Beijing Normal University, Tsinghua University, Beijing Institute of Technology, ARC Lab, Tencent PCG, ARC Lab, Tencent PCG
Abstract:
The recent neural surface reconstruction approaches using volume rendering have made much progress by achieving impressive surface reconstruction quality, but are still limited to dense and highly accurate posed views. To overcome such drawbacks, this paper pays special attention on the consistent surface reconstruction from sparse views with noisy camera poses. Unlike previous approaches, the key difference of this paper is to exploit the multiview constraints directly from the explicit geometry of the neural surface, which can be used as effective regularization to jointly learn the neural surface and refine the camera poses. To build effective multi-view constraints, we introduce a fast differentiable on-surface intersection to generate on-surface points, and propose view-consistent losses on such differentiable points to regularize the neural surface learning. Based on this point, we propose a joint learning strategy, named SC-NeuS, to perform geometry-consistent surface reconstruction in an end-to-end manner. With extensive evaluation on public datasets, our SC-NeuS can achieve consistently better surface reconstruction results with fine-grained details than previous approaches, especially from sparse and noisy camera views. The source code is available at https://github.com/zouzx/sc-neus.git.



Paperid:261
Authors:Shuying Huang, Ge Chen, Yong Yang, Xiaozheng Wang, Chenbin Liang
Tiangong University, Tiangong University, Tiangong University, Tiangong University, Tiangong University
Abstract:
Due to the unique environment and inherent properties of magnetic resonance imaging (MRI) instruments, MR images typically have lower resolution. Therefore, improving the resolution of MR images is beneficial for assisting doctors in diagnosing the condition. Currently, the existing MR image superresolution (SR) methods still have the problem of insufficient detail reconstruction. To overcome this issue, this paper proposes a multi-level feature transfer network (MFTN) based on MRI-Transformer to realize SR of low-resolution MRI data. MFTN consists of a multi-scale feature reconstruction network (MFRN) and a multi-level feature extraction branch (MFEB). MFRN is constructed as a pyramid structure to gradually reconstruct image features at different scales by integrating the features obtained from MFEB, and MFEB is constructed to provide detail information at different scales for low resolution MR image SR reconstruction by constructing multiple MRI-Transformer modules. Each MRI-Transformer module is designed to learn the transfer features from the reference image by establishing feature correlations between the reference image and low-resolution MR image. In addition, a contrast learning constraint item is added to the loss function to enhance the texture details of the SR image. A large number of experiments show that our network can effectively reconstruct high-quality MR Images and achieves better performance compared to some state-of-the-art methods. The source code of this work will be released on GitHub.



Paperid:262
Authors:Wenmin Huang, Weiqi Luo, Jiwu Huang, Xiaochun Cao
Sun Yat-sen University, Sun Yat-sen University, Shenzhen University, Sun Yat-sen University
Abstract:
Facial attribute editing has garnered significant attention, yet prevailing methods struggle with achieving precise attribute manipulation while preserving irrelevant details and controlling attribute styles. This challenge primarily arises from the strong correlations between different attributes and the interplay between attributes and identity. In this paper, we propose Semantic Disentangled GAN (SDGAN), a novel method addressing this challenge. SDGAN introduces two key concepts: a semantic disentanglement generator that assigns facial representations to distinct attributespecific editing modules, enabling the decoupling of the facial attribute editing process, and a semantic mask alignment strategy that confines attribute editing to appropriate regions, thereby avoiding undesired modifications. Leveraging these concepts, SDGAN demonstrates accurate attribute editing and achieves high-quality attribute style manipulation through both latent-guided and reference-guided manners. We extensively evaluate our method on the CelebA-HQ database, providing both qualitative and quantitative analyses. Our results establish that SDGAN significantly outperforms state-of-the-art techniques, showcasing the effectiveness of our approach. To foster reproducibility and further research, we will provide the code for our method.



Paperid:263
Authors:Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou, Yifan Zuo, Wanli Ouyang
Shanghai AI Laboratory, Jiangxi University of Finance and Economics, University of Electronic Science and Technology of China, Nanjing University of Science and Technology, Shanghai AI Laboratory, Shanghai AI Laboratory, Jiangxi University of Finance and Economics, Shanghai AI Laboratory
Abstract:
The pretrainfinetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces Efficient Point Cloud Learning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are 19.7 AP50 on ScanNet V2 detection, 4.4 mIoU on S3DIS segmentation and 1.2 mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.



Paperid:264
Authors:Xin Huang, Yunfeng Bai, Dong Liang, Feng Tian, Jinyuan Jia
Tongji University, Tongji University, Tongji University, Duke Kunshan University, Tongji University
Abstract:
Existing GANbased approaches to caricature generation mainly focus on exaggerating a character’s global facial structure. This often leads to the failure in highlighting significant facial features such as big eyes and hook nose. To address this limitation, we propose a new approach termed as G2L-CariGAN, which uses feature maps of spatial dimensions instead of latent codes for geometric exaggeration. G2L-CariGAN first exaggerates the global facial structure of the character on a low-dimensional feature map and then exaggerates its local facial features on a high-dimensional feature map. Moreover, we develop a caricature identity loss function based on feature maps, which well retains the character's identity after exaggeration. Our experiments have demonstrated that G2L-CariGAN outperforms the state-of-arts in terms of the quality of exaggerating a character and retaining its identity.



Paperid:265
Authors:Xuan Huang, Hanhui Li, Zejun Yang, Zhisheng Wang, Xiaodan Liang
Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Tencent, Shenzhen, China, Tencent, Shenzhen, China, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China DarkMatter AI Research, Guangzhou, China
Abstract:
Neural radiance fields (NeRFs) are promising 3D representations for scenes, objects, and humans. However, most existing methods require multiview inputs and per-scene training, which limits their real-life applications. Moreover, current methods focus on single-subject cases, leaving scenes of interacting hands that involve severe inter-hand occlusions and challenging view variations remain unsolved. To tackle these issues, this paper proposes a generalizable visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically, given an image of interacting hands as input, our VA-NeRF first obtains a mesh-based representation of hands and extracts their corresponding geometric and textural features. Subsequently, a feature fusion module that exploits the visibility of query points and mesh vertices is introduced to adaptively merge features of both hands, enabling the recovery of features in unseen areas. Additionally, our VA-NeRF is optimized together with a novel discriminator within an adversarial learning paradigm. In contrast to conventional discriminators that predict a single real/fake label for the synthesized image, the proposed discriminator generates a pixel-wise visibility map, providing fine-grained supervision for unseen areas and encouraging the VA-NeRF to improve the visual quality of synthesized images. Experiments on the Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms conventional NeRFs significantly. Project Page: https://github.com/XuanHuang0/VANeRF.



Paperid:266
Authors:Xun Huang, Hai Wu, Xin Li, Xiaoliang Fan, Chenglu Wen, Cheng Wang
Xiamen University, Xiamen University, Texas A&M University, Xiamen University, Xiamen University, Xiamen University
Abstract:
LiDARbased 3D object detection models inevitably struggle under rainy conditions due to the degraded and noisy scanning signals. Previous research has attempted to address this by simulating the noise from rain to improve the robustness of detection models. However, significant disparities exist between simulated and actual rain-impacted data points. In this work, we propose a novel rain simulation method, termed DRET, that unifies Dynamics and Rainy Environment Theory to provide a cost-effective means of expanding the available realistic rain data for 3D detection training. Furthermore, we present a Sunny-to-Rainy Knowledge Distillation (SRKD) approach to enhance 3D detection under rainy conditions. Extensive experiments on the Waymo-Open-Dataset show that, when combined with the state-of-the-art DSVT model and other classical 3D detectors, our proposed framework demonstrates significant detection accuracy improvements, without losing efficiency. Remarkably, our framework also improves detection capabilities under sunny conditions, therefore offering a robust solution for 3D detection regardless of whether the weather is rainy or sunny.



Paperid:267
Authors:Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang
Zhejiang University, Fuxi AI Lab, Netease Inc., Zhejiang University, Fuxi AI Lab, Netease Inc. Zhejiang University, Fuxi AI Lab, Netease Inc., Fuxi AI Lab, Netease Inc., Fuxi AI Lab, Netease Inc., Zhejiang University, Fuxi AI Lab, Netease Inc., Fuxi AI Lab, Netease Inc., Zhejiang University
Abstract:
Largescale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.



Paperid:268
Authors:Yuhao Huang, Sanping Zhou, Junjie Zhang, Jinpeng Dong, Nanning Zheng
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi‘an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Efficient representation of point clouds is fundamental for LiDARbased 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we propose a hybrid detection framework named Voxel-Pillar Fusion (VPF), which synergistically combines the unique strengths of both voxels and pillars. To be concrete, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our computationally efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward representation, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset.



Paperid:269
Authors:Tran Huynh, Dang Nguyen, Tung Pham, Anh Tran
VinAI Research, University of Maryland, VinAI Research, VinAI Research
Abstract:
Backdoor attacks pose a critical concern to the practice of using thirdparty data for AI development. The data can be poisoned to make a trained model misbehave when a predefined trigger pattern appears, granting the attackers illegal benefits. While most proposed backdoor attacks are dirty-label, clean-label attacks are more desirable by keeping data labels unchanged to dodge human inspection. However, designing a working clean-label attack is a challenging task, and existing clean-label attacks show underwhelming performance. In this paper, we propose a novel mechanism to develop clean-label attacks with outstanding attack performance. The key component is a trigger pattern generator, which is trained together with a surrogate model in an alternating manner. Our proposed mechanism is flexible and customizable, allowing different backdoor trigger types and behaviors for either single or multiple target labels. Our backdoor attacks can reach near-perfect attack success rates and bypass all state-of-the-art backdoor defenses, as illustrated via comprehensive experiments on standard benchmark datasets. Our code is available at https://github.com/VinAIResearch/COMBAT.



Paperid:270
Authors:Junha Hyung, Jaeyo Shin, Jaegul Choo
KAIST, Sogang University, KAIST
Abstract:
Largescale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.



Paperid:271
Authors:Jinhyeok Jang, Chan-Hyun Youn, Minsu Jeon, Changha Lee
KAIST, KAIST, KAIST, KAIST
Abstract:
Recent significant advancements in diffusion models have revolutionized image generation, enabling the synthesis of highly realistic images with textbased guidance. These breakthroughs have paved the way for constructing datasets via generative artificial intelligence (AI), offering immense potential for various applications. However, two critical challenges hinder the widespread adoption of synthesized data: computational cost and the generation of peculiar images. While computational costs have improved through various approaches, the issue of peculiar image generation remains relatively unexplored. Existing solutions rely on heuristics, extra training, or AI-based post-processing to mitigate this problem. In this paper, we present a novel approach to address both issues simultaneously. We establish that both gradient descent and diffusion sampling are specific cases of the generalized expectation maximization algorithm. We hypothesize and empirically demonstrate that peculiar image generation is akin to the local minima problem in optimization. Inspired by optimization techniques, we apply naive momentum and positive-negative momentum to diffusion sampling. Last, we propose new metrics to evaluate the peculiarity. Experimental results show momentum effectively prevents peculiar image generation without extra computation.



Paperid:272
Authors:Joonhyun Jeong, Geondo Park, Jayeon Yoo, Hyungsik Jung, Heesu Kim
NAVER Cloud, ImageVision, Korea Advanced Institute of Science and Technology, Seoul National University, NAVER Cloud, ImageVision, NAVER Cloud, ImageVision
Abstract:
Openvocabulary object detection (OVOD) aims to recognize novel objects whose categories are not included in the training set. In order to classify these unseen classes during training, many OVOD frameworks leverage the zero-shot capability of largely pretrained vision and language models, such as CLIP. To further improve generalization on the unseen novel classes, several approaches proposed to additionally train with pseudo region labeling on the external data sources that contain a substantial number of novel category labels beyond the existing training data. Albeit its simplicity, these pseudo-labeling methods still exhibit limited improvement with regard to the truly unseen novel classes that were not pseudo-labeled. In this paper, we present a novel, yet simple technique that helps generalization on the overall distribution of novel classes. Inspired by our observation that numerous novel classes reside within the convex hull constructed by the base (seen) classes in the CLIP embedding space, we propose to synthesize proxy-novel classes approximating novel classes via linear mixup between a pair of base classes. By training our detector with these synthetic proxy-novel classes, we effectively explore the embedding space of novel classes. The experimental results on various OVOD benchmarks such as LVIS and COCO demonstrate superior performance on novel classes compared to the other state-of-the-art methods. Code is available at https://github.com/clovaai/ProxyDet.



Paperid:273
Authors:Liya Ji, Zhefan Rao, Sinno Jialin Pan, Chenyang Lei, Qifeng Chen
Hong Kong University of Science and Technology, Hong Kong University of Science and Technology, The Chinese University of Hong Kong, CAIR, HKISI-CAS, Hong Kong University of Science and Technology
Abstract:
Solving the task of inverse imaging problems can restore unknown clean images from input measurements that have incomplete information. Utilizing powerful generative models, such as denoising diffusion models, could better tackle the illposed issues of inverse problems with the distribution prior of the unknown clean images. We propose a learnable state-estimator-based diffusion model to incorporate the measurements into the reconstruction process. Our method makes efficient use of the pre-trained diffusion models with computational feasibility compared to the conditional diffusion models, which need to be trained from scratch. In addition, our pipeline does not require explicit knowledge of the image degradation operator or make the assumption of its form, unlike many other works that use the pre-trained diffusion models at the test time. The experiments on three typical inverse imaging problems (both linear and non-linear), inpainting, deblurring, and JPEG compression restoration, have comparable results with the state-of-the-art methods.



Paperid:274
Authors:Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang
School of Computer Science and Technology, MOEKLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, MOEKLINNS Lab, Xi’an Jiaotong University, School of Computer Science and Technology, MOEKLINNS Lab, Xi’an Jiaotong University, SGIT AI Lab State Grid Corporation of China, University of Technology Sydney Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Zhejiang University SGIT AI Lab, Baidu Inc
Abstract:
Despite significant progress in Textto-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.



Paperid:275
Authors:Chaoya Jiang, Wei Ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang
Peking University, Peking University, Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group, Peking University
Abstract:
Selfsupervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMix from a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios. Our code is available on https://github.com/chaoyajiang/TiMiX/tree/main.



Paperid:276
Authors:Chenyi Jiang, Haofeng Zhang
Nanjing University of Science and Technology, Nanjing University of Science and Technology
Abstract:
Compositional ZeroShot Learning (CZSL) aims to transfer knowledge from seen state-object pairs to novel unseen pairs. In this process, visual bias caused by the diverse interrelationship of state-object combinations blurs their visual features, hindering the learning of distinguishable class prototypes. Prevailing methods concentrate on disentangling states and objects directly from visual features, disregarding potential enhancements that could arise from a data viewpoint. Experimentally, we unveil the results caused by the above problem closely approximate the long-tailed distribution. As a solution, we transform CZSL into a proximate class imbalance problem. We mathematically deduce the role of class prior within the long-tailed distribution in CZSL. Building upon this insight, we incorporate visual bias caused by compositions into the classifier's training and inference by estimating it as a proximate class prior. This enhancement encourages the classifier to acquire more discernible class prototypes for each composition, thereby achieving more balanced predictions. Experimental results demonstrate that our approach elevates the model's performance to the state-of-the-art level, without introducing additional parameters.



Paperid:277
Authors:Guangfeng Jiang, Jun Liu, Yuzhi Wu, Wenlong Liao, Tao He, Pai Peng
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Shanghai Jiao Tong University, Cowarobot, Cowarobot
Abstract:
Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite timeconsuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude. Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label correction modules for both 2D and 3D modalities, along with a new multimodal cross-supervision approach. In the 2D pseudo label generation branch, the Instance-based Pseudo Mask Generation (IPG) module utilizes predictions for self-supervised correction. Similarly, in the 3D pseudo label generation branch, the Spatial-based Pseudo Label Generation (SPG) module generates pseudo labels by incorporating the spatial prior information of the point cloud. To further refine the generated pseudo labels, the Point-based Voting Label Correction (PVC) module utilizes historical predictions for correction. Additionally, a Ring Segment-based Label Correction (RSC) module is proposed to refine the predictions by leveraging the depth prior information from the point cloud. Finally, the Consistency Sparse Cross-modal Supervision (CSCS) module reduces the inconsistency of multimodal predictions by response distillation. Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations. On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively. The code is available at https://github.com/jiangxb98/mwsis-plugin.



Paperid:278
Authors:Hao Jiang, Yang Yizhang, Yadong Mu
Peking University, Peking University, Peking University
Abstract:
Video moment localization stands as a crucial task within the realm of computer vision, entailing the identification of temporal moments in untrimmed videos that bear semantic relevance to the supplied natural language queries. This work delves into a relatively unexplored facet of the task: the transferability of video moment localization models. This concern is addressed by evaluating moment localization models within a crossdomain transfer setting. In this setup, we curate multiple datasets distinguished by substantial domain gaps. The model undergoes training on one of these datasets, while validation and testing are executed using the remaining datasets. To confront the challenges inherent in this scenario, we draw inspiration from the recently introduced large-scale pre-trained vision-language models. Our focus is on exploring how the strategic utilization of these resources can bolster the capabilities of a model designed for video moment localization. Nevertheless, the distribution of language queries in video moment localization usually diverges from the text used by pre-trained models, exhibiting distinctions in aspects such as length, content, expression, and more. To mitigate the gap, this work proposes a Moment-Guided Query Prompting (MGQP) method for video moment localization. Our key idea is to generate multiple distinct and complementary prompt primitives through stratification of the original queries. Our approach is comprised of a prompt primitive constructor, a multimodal prompt refiner, and a holistic prompt incorporator. We carry out extensive experiments on Charades-STA, TACoS, DiDeMo, and YouCookII datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, ActionCLIP, CLIP4Clip, and VideoCLIP. The experimental results demonstrate the effectiveness of our proposed method.



Paperid:279
Authors:Shijian Jiang, Qi Ye, Rengan Xie, Yuchi Huo, Xiang Li, Yang Zhou, Jiming Chen
College of Control Science and Engineering, Zhejiang University, College of Control Science and Engineering, Zhejiang University Key Lab of CS&AUS of Zhejiang Province, State Key Lab of CAD&CG, Zhejiang University, State Key Lab of CAD&CG, Zhejiang University Zhejiang Lab, OPPO US Research Center, OPPO US Research Center, College of Control Science and Engineering, Zhejiang University
Abstract:
Our work aims to reconstruct a 3D object that is held and rotated by a hand in front of a static RGB camera. Previous methods that use implicit neural representations to recover the geometry of a generic handheld object from multi-view images achieved compelling results in the visible part of the object. However, these methods falter in accurately capturing the shape within the hand-object contact region due to occlusion. In this paper, we propose a novel method that deals with surface reconstruction under occlusion by incorporating priors of 2D occlusion elucidation and physical contact constraints. For the former, we introduce an object amodal completion network to infer the 2D complete mask of objects under occlusion. To ensure the accuracy and view consistency of the predicted 2D amodal masks, we devise a joint optimization method for both amodal mask refinement and 3D reconstruction. For the latter, we impose penetration and attraction constraints on the local geometry in contact regions. We evaluate our approach on HO3D and HOD datasets and demonstrate that it outperforms the state-of-the-art methods in terms of reconstruction surface quality, with an improvement of 52% on HO3D and 20% on HOD. Project webpage: https://east-j.github.io/ihor.



Paperid:280
Authors:Shiqi Jiang, Ning Li, Chen Shi, Liping Guo, Changbo Wang, Chenhui Li
East China Normal University, East China Normal University, East China Normal University, East China Normal University, East China Normal University, East China Normal University
Abstract:
The Aesthetics Assessment of Children's Paintings (AACP) is an important branch of the image aesthetics assessment (IAA), playing a significant role in children's education. This task presents unique challenges, such as limited available data and the requirement for evaluation metrics from multiple perspectives. However, previous approaches have relied on training large datasets and subsequently providing an aesthetics score to the image, which is not applicable to AACP. To solve this problem, we construct an aesthetics assessment dataset of children's paintings and a model based on selfsupervised learning. 1) We build a novel dataset composed of two parts: the first part contains more than 20k unlabeled images of children's paintings; the second part contains 1.2k images of children's paintings, and each image contains eight attributes labeled by multiple design experts. 2) We design a pipeline that includes a feature extraction module, perception modules and a disentangled evaluation module. 3) We conduct both qualitative and quantitative experiments to compare our model's performance with five other methods using the AACP dataset. Our experiments reveal that our method can accurately capture aesthetic features and achieve state-of-the-art performance.



Paperid:281
Authors:Weibo Jiang, Weihong Ren, Jiandong Tian, Liangqiong Qu, Zhiyong Wang, Honghai Liu
State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen, State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen, State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Science, Department of Statistics and Actuarial Science, The University of Hong Kong, State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen, State Key Laboratory of Robotics and System, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen
Abstract:
HumanObject Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of . Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.



Paperid:282
Authors:Wenhui Jiang, Yibo Cheng, Linxin Liu, Yuming Fang, Yuxin Peng, Yang Liu
Jiangxi University of Finance and Economics, Jiangxi University of Finance and Economics, Jiangxi University of Finance and Economics, Jiangxi University of Finance and Economics, Peking University, Sany Heavy Industry Co., LTD
Abstract:
The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations, whereas the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus. Moreover, these methods seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by explicitly linking the entities and actions to the visual clues across the video frames. Specifically, the network consists of spatialtemporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames, albeit the entity is annotated in only one frame of a video. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision, which brings architecture simplification and improves training efficiency as well. We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on ActivityNet-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts.



Paperid:283
Authors:Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, Xiangyu Zhang
Beijing Insititute of Technnology, Megvii Technology, Megvii Technology, Beijing Institute of Technology, Megvii Technology, Megvii Technology, Beijing Insititute of Technnology, Megvii Technology
Abstract:
Recently 3D object detection from surroundview images has made notable advancements with its low deployment cost. However, most works have primarily focused on close perception range while leaving long-range detection less explored. Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence. To address these limitations, this paper proposes a novel sparse query-based framework, dubbed Far3D. By utilizing high-quality 2D object priors, we generate 3D adaptive queries that complement the 3D global queries. To efficiently capture discriminative features across different views and scales for long-range objects, we introduce a perspective-aware aggregation module. Additionally, we propose a range-modulated 3D denoising approach to address query error propagation and mitigate convergence issues in long-range tasks. Significantly, Far3D demonstrates SoTA performance on the challenging Argoverse 2 dataset, covering a wide range of 150 meters, surpassing several LiDAR-based approaches. The code is available at https://github.com/megvii-research/Far3D.



Paperid:284
Authors:Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li
Nanjing University of Science and Technology, Nanjing University of Science and Technology, Tongji University, Nanjing University Of Science And Technology, Singapore Management University, Nanjing University of Science and Technology
Abstract:
Finegrained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.



Paperid:285
Authors:Yangbo Jiang, Zhiwei Jiang, Le Han, Zenan Huang, Nenggan Zheng
Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China, Guangzhou Electronic Technology Co., Ltd., Chinese Academy of Sciences, GuangZhou, China, Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China, Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China, Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, Zhejiang, China CCAI by MOE and Zhejiang Provincial Government(ZJU), Hangzhou, Zhejiang, China
Abstract:
Channel attention mechanisms endeavor to recalibrate channel weights to enhance representation abilities of networks. However, mainstream methods often rely solely on global average pooling as the feature squeezer, which significantly limits the overall potential of models. In this paper, we investigate the statistical moments of feature maps within a neural network. Our findings highlight the critical role of highorder moments in enhancing model capacity. Consequently, we introduce a flexible and comprehensive mechanism termed Extensive Moment Aggregation (EMA) to capture the global spatial context. Building upon this mechanism, we propose the Moment Channel Attention (MCA) framework, which efficiently incorporates multiple levels of moment-based information while minimizing additional computation costs through our Cross Moment Convolution (CMC) module. The CMC module via channel-wise convolution layer to capture multiple order moment information as well as cross channel features. The MCA block is designed to be lightweight and easily integrated into a variety of neural network architectures. Experimental results on classical image classification, object detection, and instance segmentation tasks demonstrate that our proposed method achieves state-of-the-art results, outperforming existing channel attention methods.



Paperid:286
Authors:Zhiying Jiang, Xingyuan Li, Jinyuan Liu, Xin Fan, Risheng Liu
Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology
Abstract:
Image stitching seamlessly integrates images captured from varying perspectives into a single wide fieldof-view image. Such integration not only broadens the captured scene but also augments holistic perception in computer vision applications. Given a pair of captured images, subtle perturbations and distortions which go unnoticed by the human visual system tend to attack the correspondence matching, impairing the performance of image stitching algorithms. In light of this challenge, this paper presents the first attempt to improve the robustness of image stitching against adversarial attacks. Specifically, we introduce a stitching-oriented attack (SoA), tailored to amplify the alignment loss within overlapping regions, thereby targeting the feature matching procedure. To establish an attack resistant model, we delve into the robustness of stitching architecture and develop an adaptive adversarial training (AAT) to balance attack resistance with stitching precision. In this way, we relieve the gap between the routine adversarial training and benign models, ensuring resilience without quality compromise. Comprehensive evaluation across real-world and synthetic datasets validate the deterioration of SoA on stitching performance. Furthermore, AAT emerges as a more robust solution against adversarial perturbations, delivering superior stitching results. Code is available at: https://github.com/Jzy2017/TRIS.



Paperid:287
Authors:Yang Jiao, Zequn Jie, Shaoxiang Chen, Lechao Cheng, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Meituan, Meituan, Zhejiang Lab, Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Meituan, Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University Shanghai Collaborative Innovation Center on Intelligent Visual Computing
Abstract:
Camerabased bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. Under such a paradigm, accurate BEV representation construction relies on reliable depth estimation for multi-camera images. However, existing approaches exhaustively predict depths for every pixel without prioritizing objects, which are precisely the entities requiring detection in the 3D space. To this end, we propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector. First, a category-specific structural priors mining approach is proposed for enhancing the efficacy of monocular depth generation. Besides, a self-boosting learning strategy is further proposed to encourage the model to place more emphasis on challenging objects in computation-expensive temporal stereo matching. Together they provide advanced depth estimation results for high-quality BEV features construction, benefiting the ultimate 3D detection. The proposed method achieves state-of-the-art performances on the challenging nuScenes benchmark, and extensive experimental results demonstrate the effectiveness of our designs.



Paperid:288
Authors:Haibo Jin, Haoxuan Che, Yi Lin, Hao Chen
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Department of Computer Science and Engineering, Hong Kong University of Science and Technology Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology
Abstract:
Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and disease identification. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnosis unreliable. To address these challenges, we propose diagnosisdriven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets.



Paperid:289
Authors:Jianlong Jin, Lei Shen, Ruixin Zhang, Chenglong Zhao, Ge Jin, Jingyun Zhang, Shouhong Ding, Yang Zhao, Wei Jia
Hefei University of Technology, China Youtu Lab, Tencent, Youtu Lab, Tencent, Youtu Lab, Tencent, Youtu Lab, Tencent, Youtu Lab, Tencent, Youtu Lab, Tencent, Youtu Lab, Tencent, Hefei University of Technology, China, Hefei University of Technology, China
Abstract:
The lack of largescale data seriously hinders the development of palmprint recognition. Recent approaches address this issue by generating large-scale realistic pseudo palmprints from Bézier curves. However, the significant difference between Bézier curves and real palmprints limits their effectiveness. In this paper, we divide the Bézier-Real difference into creases and texture differences, thus reducing the generation difficulty. We introduce a new palm crease energy (PCE) domain as a bridge from Bézier curves to real palmprints and propose a two-stage generation model. The first stage generates PCE images (realistic creases) from Bézier curves, and the second stage outputs realistic palmprints (realistic texture) with PCE images as input. In addition, we also design a lightweight plug-and-play line feature enhancement block to facilitate domain transfer and improve recognition performance. Extensive experimental results demonstrate that the proposed method surpasses state-of-the-art methods. Under extremely few data settings like 40 IDs (only 2.5% of the total training set), our model achieves a 29% improvement over RPG-Palm and outperforms ArcFace with 100% training set by more than 6% in terms of TAR@FAR=1e-6.



Paperid:290
Authors:Xin Jin, Kai Liu, Cong Ma, Ruining Yang, Fei Hui, Wei Wu
Chang’an University SenseAuto Research, SenseAuto Research, SenseAuto Research, Chang’an University SenseAuto Research, Chang’an University, SenseAuto Research Tsinghua University
Abstract:
Lidarbased 3D Detection is one of the significant components of Autonomous Driving. However, current methods over-focus on improving the performance of 3D Lidar perception, which causes the architecture of networks becoming complicated and hard to deploy. Thus, the methods are difficult to apply in Autonomous Driving for real-time processing. In this paper, we propose a high-efficiency network, SwiftPillars, which includes Swift Pillar Encoder (SPE) and Multi-scale Aggregation Decoder (MAD). The SPE is constructed by a concise Dual-attention Module with lightweight operators. The Dual-attention Module utilizes feature pooling, matrix multiplication, etc. to speed up point-wise and channel-wise attention extraction and fusion. The MAD interconnects multiple scale features extracted by SPE with minimal computational cost to leverage performance. In our experiments, our proposal accomplishes 61.3% NDS and 53.2% mAP in nuScenes dataset. In addition, we evaluate inference time on several platforms (P4, T4, A2, MLU370, RTX3080), where SwiftPillars achieves up to 13.3ms (75FPS) on NVIDIA Tesla T4. Compared with PointPillars, SwiftPillars is on average 26.58% faster in inference speed with equivalent GPUs and a higher mAP of approximately 3.2% in the nuScenes dataset.



Paperid:291
Authors:Yeying Jin, Wei Ye, Wenhan Yang, Yuan Yuan, Robby T. Tan
National University of Singapore, HUAWEI INTERNATIONAL PTE LTD, Peng Cheng Laboratory, HUAWEI INTERNATIONAL PTE LTD, National University of Singapore
Abstract:
Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pretrained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16% of the RMSE of the whole image on the LRSS dataset.



Paperid:292
Authors:Beibei Jing, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology
Abstract:
Generating realistic human motion sequences from text descriptions is a challenging task that requires capturing the rich expressiveness of both natural language and human motion. Recent advances in diffusion models have enabled significant progress in human motion synthesis. However, existing methods struggle to handle text inputs that describe complex or long motions. In this paper, we propose the Adaptable Motion Diffusion (AMD) model, which leverages a Large Language Model (LLM) to parse the input text into a sequence of concise and interpretable anatomical scripts that correspond to the target motion. This process exploits the LLM’s ability to provide anatomical guidance for complex motion synthesis. We then devise a twobranch fusion scheme that balances the influence of the input text and the anatomical scripts on the inverse diffusion process, which adaptively ensures the semantic fidelity and diversity of the synthesized motion. Our method can effectively handle texts with complex or long motion descriptions, where existing methods often fail. Experiments on datasets with relatively more complex motions, such as CLCD1 and CLCD2, demonstrate that our AMD significantly outperforms existing state-of-the-art models.



Paperid:293
Authors:Chenchen Jing, Yukun Li, Hao Chen, Chunhua Shen
Zhejiang University, Northwestern Polytechnical University, Zhejiang University, Zhejiang University
Abstract:
Compositional zeroshot learning (CZSL) aims to recognize unseen attribute-object compositions by learning from seen compositions. Composing the learned knowledge of seen primitives, i.e., attributes or objects, into novel compositions is critical for CZSL. In this work, we propose to explicitly retrieve knowledge of seen primitives for compositional zero-shot learning. We present a retrieval-augmented method, which augments standard multi-path classification methods with two retrieval modules. Specifically, we construct two databases storing the attribute and object representations of training images, respectively. For an input training/testing image, we use two retrieval modules to retrieve representations of training images with the same attribute and object, respectively. The primitive representations of the input image are augmented by using the retrieved representations, for composition recognition. By referencing semantically similar images, the proposed method is capable of recalling knowledge of seen primitives for compositional generalization. Experiments on three widely-used datasets show the effectiveness of the proposed method.



Paperid:294
Authors:Linglin Jing, Sheng Xu, Yifan Wang, Yuzhe Zhou, Tao Shen, Zhigang Ji, Hui Fang, Zhen Li, Siqi Sun
Shanghai Artifcial Intelligence Laboratory Department of Computer Science, Loughborough University, Research Institute of Intelligent Complex Systems, Fudan University Shanghai Artifcial Intelligence Laboratory, Department of Computer Science, Loughborough University, SSE & FNII, The Chinese University of Hong Kong (Shenzhen), Research Institute of Intelligent Complex Systems, Fudan University, Shanghai Jiao Tong University, Department of Computer Science, Loughborough University, SSE & FNII, The Chinese University of Hong Kong (Shenzhen), Research Institute of Intelligent Complex Systems, Fudan University Shanghai Artifcial Intelligence Laboratory
Abstract:
Accurate identification of protein nucleic acid binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large scale protein language model. Specifically, our multi modal approach leverages a contrastive learning technique and atom wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state of the art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1 Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively. We release the code at https://github.com/BEAMLabs/CrossBind.



Paperid:295
Authors:Linglin Jing, Ying Xue, Xu Yan, Chaoda Zheng, Dong Wang, Ruimao Zhang, Zhigang Wang, Hui Fang, Bin Zhao, Zhen Li
Shanghai AI laboratory Department of Computer Science, Loughborough University, FNii, CUHK-Shenzhen SSE, CUHK-Shenzhen, FNii, CUHK-Shenzhen SSE, CUHK-Shenzhen, FNii, CUHK-Shenzhen SSE, CUHK-Shenzhen, Shanghai AI laboratory, SSE, CUHK-Shenzhen FNii, CUHK-Shenzhen, Shanghai AI laboratory, Department of Computer Science, Loughborough University, Shanghai AI laboratory, SSE, CUHK-Shenzhen FNii, CUHK-Shenzhen
Abstract:
The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel crossmodal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D.



Paperid:296
Authors:Won Jo, Geuntaek Lim, Gwangjin Lee, Hyunwoo Kim, Byungsoo Ko, Yukyung Choi
Sejong University, Sejong University, Sejong University, Sejong University, NAVER Vision, Sejong University
Abstract:
In contentbased video retrieval (CBVR), dealing with large-scale collections, efficiency is as important as accuracy; thus, several video-level feature-based studies have actively been conducted. Nevertheless, owing to the severe difficulty of embedding a lengthy and untrimmed video into a single feature, these studies have been insufficient for accurate retrieval compared to frame-level feature-based studies. In this paper, we show that appropriate suppression of irrelevant frames can provide insight into the current obstacles of the video-level approaches. Furthermore, we propose a Video-to-Video Suppression network (VVS) as a solution. VVS is an end-to-end framework that consists of an easy distractor elimination stage to identify which frames to remove and a suppression weight generation stage to determine the extent to suppress the remaining frames. This structure is intended to effectively describe an untrimmed video with varying content and meaningless information. Its efficacy is proved via extensive experiments, and we show that our approach is not only state-of-the-art in video-level approaches but also has a fast inference time despite possessing retrieval capabilities close to those of frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS



Paperid:297
Authors:Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian
Indian Institute of Technology, Hyderabad, Indian Institute of Technology, Hyderabad, Microsoft Research, Bengaluru, Indian Institute of Technology, Hyderabad
Abstract:
For machine learning models to be reliable and trustworthy, their decisions must be interpretable. As these models find increasing use in safetycritical applications, it is important that not just the model predictions but also their explanations (as feature attributions) be robust to small human-imperceptible input perturbations. Recent works have shown that many attribution methods are fragile and have proposed improvements in either these methods or the model training. We observe two main causes for fragile attributions: first, the existing metrics of robustness (e.g., top-k intersection) overpenalize even reasonable local shifts in attribution, thereby making random perturbations to appear as a strong attack, and second, the attribution can be concentrated in a small region even when there are multiple important parts in an image. To rectify this, we propose simple ways to strengthen existing metrics and attribution methods that incorporate locality of pixels in robustness metrics and diversity of pixel locations in attributions. Towards the role of model training in attributional robustness, we empirically observe that adversarially trained models have more robust attributions on smaller datasets, however, this advantage disappears in larger datasets. Code is made available at https://github.com/ksandeshk/LENS.



Paperid:298
Authors:Zhehan Kan, Xueting Hu, Zihan Liao, Ke Yu, Zhihai He
Southern University of Science and Technology, Southern University of Science and Technology, Southern University of Science and Technology, Southern University of Science and Technology, Southern University of Science and Technology Pengcheng Laboratory, Shenzhen, China
Abstract:
Generalization is very important for pose estimation, especially for 3D pose estimation where small changes in the 2D images could trigger structural changes in the 3D space. To achieve generalization, the system needs to have the capability of detecting estimation errors by doublechecking the projection coherence between the 3D and 2D spaces and adapting its network inference process based on this feedback. Current pose estimation is one-time feed-forward and lacks the capability to gather feedback and adapt the inference outcome. To address this problem, we propose to explore the concept of progressive inference where the network learns an observer to continuously detect the prediction error based on constraints matching, as well as an adjuster to refine its inference outcome based on these constraints errors. Within the context of 3D hand pose estimation, we find that this observer-adjuster design is relatively unstable since the observer is operating in the 2D image domain while the adjuster is operating in the 3D domain. To address this issue, we propose to construct two sets of observers-adjusters with complementary constraints from different perspectives. They operate in a dynamic sequential manner controlled by a decision network to progressively improve the 3D pose estimation. We refer to this method as Cross-Constrained Progressive Inference (CCPI). Our extensive experimental results on FreiHAND and HO-3D benchmark datasets demonstrate that the proposed CCPI method is able to significantly improve the generalization capability and performance of 3D hand pose estimation.



Paperid:299
Authors:Minsoo Kang, Minkoo Kang, Suhyun Kim
Korea Institute of Science and Technology Korea University, Korea Institute of Science and Technology, Korea Institute of Science and Technology
Abstract:
Deep learning has made significant advances in computer vision, particularly in image classification tasks. Despite their high accuracy on training data, deep learning models often face challenges related to complexity and overfitting. One notable concern is that the model often relies heavily on a limited subset of filters for making predictions. This dependency can result in compromised generalization and an increased vulnerability to minor variations. While regularization techniques like weight decay, dropout, and data augmentation are commonly used to address this issue, they may not directly tackle the reliance on specific filters. Our observations reveal that the heavy reliance problem gets severe when slowlearning filters are deprived of learning opportunities due to fast-learning filters. Drawing inspiration from image augmentation research that combats over-reliance on specific image regions by removing and replacing parts of images, Our idea is to mitigate the problem of over-reliance on strong filters by substituting highly activated features. To this end, we present a novel method called Catch-up Mix, which provides learning opportunities to a wide range of filters during training, focusing on filters that may lag behind. By mixing activation maps with relatively lower norms, Catch-up Mix promotes the development of more diverse representations and reduces reliance on a small subset of filters. Experimental results demonstrate the superiority of our method in various vision classification datasets, providing enhanced robustness.



Paperid:300
Authors:Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo
Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University
Abstract:
ZeroShot Object Counting~(ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, we propose an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP. Subsequently, we extend the VLBase to Visual-language Counter (VLCounter) by incorporating three modules devised to tailor VLBase for object counting. First, we introduce Semantic-conditioned Prompt Tuning (SPT) within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, we transfer the layer-wisely encoded features to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, we demonstrate the benefits of our end-to-end framework, VLCounter. Code is available at https://github.com/seunggu0305/VLCounter



Paperid:301
Authors:Xiao Ke, Huanqi Wu, Wenzhong Guo
Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China, Fujian Provincial Key Laboratory of Networking Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou 350116, China
Abstract:
Image hiding aims to conceal one or more secret images within a cover image of the same resolution. Due to strict capacity requirements, image hiding is commonly called largecapacity steganography. In this paper, we propose StegFormer, a novel autoencoder-based image-hiding model. StegFormer can conceal one or multiple secret images within a cover image of the same resolution while preserving the high visual quality of the stego image. In addition, to mitigate the limitations of current steganographic models in real-world scenarios, we propose a normalizing training strategy and a restrict loss to improve the reliability of the steganographic models under realistic conditions. Furthermore, we propose an efficient steganographic capacity expansion method to increase the capacity of steganography and enhance the efficiency of secret communication. Through this approach, we can increase the relative payload of StegFormer to 96 bits per pixel without any training strategy modifications. Experiments demonstrate that our StegFormer outperforms existing state-of-the-art (SOTA) models. In the case of single-image steganography, there is an improvement of more than 3 dB and 5 dB in PSNR for secret/recovery image pairs and cover/stego image pairs.



Paperid:302
Authors:Bumsoo Kim, Jinhyung Kim, Yeonsik Jo, Seung Hwan Kim
LG AI Research, LG AI Research, LG AI Research, LG AI Research
Abstract:
Recent advances in vision language pretraining (VLP) have been largely attributed to the largescale data collected from the web. However, uncurated dataset contains weakly correlated image-text pairs, causing data inefficiency. To address the issue, knowledge distillation have been explored at the expense of extra image and text momentum encoders to generate teaching signals for misaligned image-text pairs. In this paper, our goal is to resolve the misalignment problem with an efficient distillation framework. To this end, we propose ECLIPSE: Expediting Contrastive Language-Image Pretraining with Self-distilled Encoders. ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder. This strategic design choice enables the distillation to operate within a unified projected space of text embedding, resulting in better performance. Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder. Through our extensive experiments, we validate that there is a sweet spot between expedition and distillation where the partial view from the expedited online image encoder interacts complementarily with the momentum teacher. As a result, ECLIPSE outperforms its counterparts while achieving substantial acceleration in inference speed.



Paperid:303
Authors:Dongseob Kim, Seungho Lee, Junsuk Choe, Hyunjung Shim
Yonsei University, Yonsei University, Sogang University, Korea Advanced Institute of Science & Technology
Abstract:
Stateof-the-art techniques in weakly-supervised semantic segmentation (WSSS) using image-level labels exhibit severe performance degradation on driving scene datasets such as Cityscapes. To address this challenge, we develop a new WSSS framework tailored to driving scene datasets. Based on extensive analysis of dataset characteristics, we employ Contrastive Language-Image Pre-training (CLIP) as our baseline to obtain pseudo-masks. However, CLIP introduces two key challenges: (1) pseudo-masks from CLIP lack in representing small object classes, and (2) these masks contain notable noise. We propose solutions for each issue as follows. (1) We devise Global-Local View Training that seamlessly incorporates small-scale patches during model training, thereby enhancing the model's capability to handle small-sized yet critical objects in driving scenes (e.g., traffic light). (2) We introduce Consistency-Aware Region Balancing (CARB), a novel technique that discerns reliable and noisy regions through evaluating the consistency between CLIP masks and segmentation predictions. It prioritizes reliable pixels over noisy pixels via adaptive loss weighting. Notably, the proposed method achieves 51.8\% mIoU on the Cityscapes test dataset, showcasing its potential as a strong WSSS baseline on driving scene datasets. Experimental results on CamVid and WildDash2 demonstrate the effectiveness of our method across diverse datasets, even with small-scale datasets or visually challenging conditions. The code is available at https://github.com/k0u-id/CARB.



Paperid:304
Authors:GeonU Kim, Kim Youwang, Tae-Hyun Oh
Grad. School of Artificial Intelligence, POSTECH, Dept. of Electrical. Engineering, POSTECH, Dept. of Electrical. Engineering, POSTECH Grad. School of Artificial Intelligence, POSTECH Institute for Convergence Research and Education in Advanced Technology, Yonsei University
Abstract:
We present FPRF, a feedforward photorealistic style transfer method for large-scale 3D neural radiance fields. FPRF stylizes large-scale 3D scenes with arbitrary, multiple style reference images without additional optimization while preserving multi-view appearance consistency. Prior arts required tedious per-style/-scene optimization and were limited to small-scale 3D scenes. FPRF efficiently stylizes large-scale 3D scenes by introducing a style-decomposed 3D neural radiance field, which inherits AdaIN’s feed-forward stylization machinery, supporting arbitrary style reference images. Furthermore, FPRF supports multi-reference stylization with the semantic correspondence matching and local AdaIN, which adds diverse user control for 3D scene styles. FPRF also preserves multi-view consistency by applying semantic matching and style transfer processes directly onto queried features in 3D space. In experiments, we demonstrate that FPRF achieves favorable photorealistic quality 3D scene stylization for large-scale scenes with diverse reference images.



Paperid:305
Authors:Ji-Hoon Kim, Jaehun Kim, Joon Son Chung
Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Abstract:
The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lipto-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.



Paperid:306
Authors:Jiyoung Kim, Kyuhong Shim, Insu Lee, Byonghyo Shim
Seoul National University, Seoul National University, Seoul National University, Seoul National University
Abstract:
Unsupervised semantic segmentation (USS) aims to discover and recognize meaningful categories without any labels. For a successful USS, two key abilities are required: 1) information compression and 2) clustering capability. Previous methods have relied on feature dimension reduction for information compression, however, this approach may hinder the process of clustering. In this paper, we propose a novel USS framework called Expandand-Quantize Unsupervised Semantic Segmentation (EQUSS), which combines the benefits of high-dimensional spaces for better clustering and product quantization for effective information compression. Our extensive experiments demonstrate that EQUSS achieves state-of-the-art results on three standard benchmarks. In addition, we analyze the entropy of USS features, which is the first step towards understanding USS from the perspective of information theory.



Paperid:307
Authors:Seoha Kim, Jeongmin Bae, Youngsik Yun, Hahyun Lee, Gun Bang, Youngjung Uh
Yonsei University, Yonsei University, Yonsei University, Electronics and Telecommunications Research Institute, Electronics and Telecommunications Research Institute, Yonsei University
Abstract:
Recent advancements in 4D scene reconstruction using neural radiance fields (NeRF) have demonstrated the ability to represent dynamic scenes from multiview videos. However, they fail to reconstruct the dynamic scenes and struggle to fit even the training views in unsynchronized settings. It happens because they employ a single latent embedding for a frame while the multi-view images at the same frame were actually captured at different moments. To address this limitation, we introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF. By design, our method is applicable for various baselines and improves them with large margins. Furthermore, finding the offsets always works as synchronizing the videos without manual effort. Experiments are conducted on the common Plenoptic Video Dataset and a newly built Unsynchronized Dynamic Blender Dataset to verify the performance of our method. Project page: https://seoha-kim.github.io/sync-nerf



Paperid:308
Authors:Seongyeop Kim, Hyung-Il Kim, Yong Man Ro
Integrated Vision Language Lab., KAIST, South Korea, ETRI, South Korea, Integrated Vision Language Lab., KAIST, South Korea
Abstract:
Open Set Recognition (OSR) poses significant challenges in distinguishing known from unknown classes. In OSR, the overconfidence problem has become a persistent obstacle, where visual recognition models often misclassify unknown objects as known objects with high confidence. This issue stems from the fact that visual recognition models often lack the integration of commonsense knowledge, a feature that is naturally present in language-based models but lacking in visual recognition systems. In this paper, we propose a novel approach to enhance OSR performance by distilling common-sense knowledge into visual prompts. Utilizing text prompts that embody common-sense knowledge about known classes, the proposed visual prompt is learned by extracting semantic common-sense features and aligning them with image features from visual recognition models. The unique aspect of this work is the training of individual visual prompts for each class to encapsulate this common-sense knowledge. Our methodology is model-agnostic, capable of enhancing OSR across various visual recognition models, and computationally light as it focuses solely on training the visual prompts. This research introduces a method for addressing OSR, aiming at a more systematic integration of visual recognition systems with common-sense knowledge. The obtained results indicate an enhancement in recognition accuracy, suggesting the applicability of this approach in practical settings.



Paperid:309
Authors:Sunoh Kim, Jungchan Cho, Joonsang Yu, YoungJoon Yoo, Jin Young Choi
Seoul National University, Gachon University, NAVER CLOVA, NAVER CLOVA, Seoul National University
Abstract:
In the weakly supervised temporal video grounding study, previous methods use predetermined single Gaussian proposals which lack the ability to express diverse events described by the sentence query. To enhance the expression ability of a proposal, we propose a Gaussian mixture proposal (GMP) that can depict arbitrary shapes by learning importance, centroid, and range of every Gaussian in the mixture. In learning GMP, each Gaussian is not trained in a feature space but is implemented over a temporal location. Thus the conventional featurebased learning for Gaussian mixture model is not valid for our case. In our special setting, to learn moderately coupled Gaussian mixture capturing diverse events, we newly propose a pull-push learning scheme using pulling and pushing losses, each of which plays an opposite role to the other. The effects of components in our scheme are verified in-depth with extensive ablation studies and the overall scheme achieves state-of-the-art performance. Our code is available at https://github.com/sunoh-kim/pps.



Paperid:310
Authors:Florian Kluger, Bodo Rosenhahn
Leibniz University Hannover, Leibniz University Hannover
Abstract:
We present a realtime method for robust estimation of multiple instances of geometric models from noisy data. Geometric models such as vanishing points, planar homographies or fundamental matrices are essential for 3D scene analysis. Previous approaches discover distinct model instances in an iterative manner, thus limiting their potential for speedup via parallel computation. In contrast, our method detects all model instances independently and in parallel. A neural network segments the input data into clusters representing potential model instances by predicting multiple sets of sample and inlier weights. Using the predicted weights, we determine the model parameters for each potential instance separately in a RANSAC-like fashion. We train the neural network via task-specific loss functions, i.e. we do not require a ground-truth segmentation of the input data. As suitable training data for homography and fundamental matrix fitting is scarce, we additionally present two new synthetic datasets. We demonstrate state-of-the-art performance on these as well as multiple established datasets, with inference times as small as five milliseconds per image.



Paperid:311
Authors:Dimitrios Kollias, Viktoriia Sharmanska, Stefanos Zafeiriou
Queen Mary University of London, University of Sussex Imperial College London, Imperial College London
Abstract:
MultiTask Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space, or parameter transfer. To provide sufficient learning support, modern MTL uses annotated data with full, or sufficiently large overlap across tasks, i.e., each input sample is annotated for all, or most of the tasks. However, collecting such annotations is prohibitive in many real applications, and cannot benefit from datasets available for individual tasks. In this work, we challenge this setup and show that MTL can be successful with classification tasks with little, or non-overlapping annotations, or when there is big discrepancy in the size of labeled data per task. We explore task-relatedness for co-annotation and co-training, and propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching. To demonstrate the general applicability of our method, we conducted diverse case studies in the domains of affective computing, face recognition, species recognition, and shopping item classification using nine datasets. Our large-scale study of affective tasks for basic expression recognition and facial action unit detection illustrates that our approach is network agnostic and brings large performance improvements compared to the state-of-the-art in both tasks and across all studied databases. In all case studies, we show that co-training via task-relatedness is advantageous and prevents negative transfer (which occurs when MT model's performance is worse than that of at least one single-task model).



Paperid:312
Authors:Xiaoyu Kong, Yongyong Chen, Feng Zheng, Zhenyu He
Harbin Institute of Technology (Shenzhen), Harbin Institute of Technology (Shenzhen), Southern University of Science and Technology, Harbin Institute of Technology (Shenzhen)
Abstract:
Block image compressive sensing methods, which divide a single image into small blocks for efficient sampling and reconstruction, have achieved significant success. However, these methods process each block locally and thus disregard the global communication among different blocks in the reconstruction step. Existing methods have attempted to address this issue with local filters or by directly reconstructing the entire image, but they have only achieved insufficient communication among adjacent pixels or bypassed the problem. To directly confront the communication problem among blocks and effectively resolve it, we propose a novel approach called Block Reconstruction with Blocks' Communication Network (BRBCN). BRBCN focuses on both local and global information, while further taking their interactions into account. Specifically, BRBCN comprises dual CNN and Transformer architectures, in which CNN is used to reconstruct each block for powerful local processing and Transformer is used to calculate the global communication among all the blocks. Moreover, we propose a globalto-local module (G2L) and a local-to-global module (L2G) to effectively integrate the representations of CNN and Transformer, with which our BRBCN network realizes the bidirectional interaction between local and global information. Extensive experiments show our BRBCN method outperforms existing state-of-the-art methods by a large margin. The code is available at https://github.com/kongxiuxiu/BRBCN



Paperid:313
Authors:Yogesh Kumar, Saswat Mallick, Anand Mishra, Sowmya Rasipuram, Anutosh Maitra, Roshni Ramnani
Indian Institute of Technology, Jodhpur, Indian Institute of Technology, Jodhpur, Indian Institute of Technology, Jodhpur, Accenture Technology Labs, Accenture Technology Labs, Accenture Technology Labs
Abstract:
In this work, we study oneshot video object localization problem that aims to localize instances of unseen objects in the target video using a single query image of the object. Toward addressing this challenging problem, we extend a popular and successful object detection method, namely DETR (Detection Transformer), and introduce a novel approach –query-guided detection transformer for videos (QDETRv). A distinctive feature of QDETRv is its capacity to exploit information from the query image and spatio-temporal context of the target video, which significantly aids in precisely pinpointing the desired object in the video. We incorporate cross-attention mechanisms that capture temporal relationships across adjacent frames to handle the dynamic context in videos effectively. Further, to ensure strong initialization for QDETRv, we also introduce a novel unsupervised pretraining technique tailored to videos. This involves training our model on synthetic object trajectories with an analogous objective as the query-guided localization task. During this pretraining phase, we incorporate recurrent object queries and loss functions that encourage accurate patch feature reconstruction. These additions enable better temporal understanding and robust representation learning. Our experiments show that the proposed model significantly outperforms the competitive baselines on two public benchmarks, VidOR and ImageNet-VidVRD, extended for one-shot open-set localization tasks.



Paperid:314
Authors:Nilakshan Kunananthaseelan, Jing Zhang, Mehrtash Harandi
Monash University, Australian National University, Monash University
Abstract:
We introduce a languagegrounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder, eliminating the need to modify or add to the model's parameters. Due to this design choice, our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained. We will empirically demonstrate that, compared to prior art, grounding visual prompts with language enhances both the accuracy and speed of adaptation. Moreover, our algorithm excels in base-to-novel class generalization, overcoming limitations of visual prompting and exhibiting the capacity to generalize beyond seen classes. We thoroughly assess and evaluate our method across a variety of image recognition datasets, such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning situations, including few-shot adaptation, base-to-novel class generalization, and transfer learning.



Paperid:315
Authors:Chengen Lai, Shengli Song, Shiqi Meng, Jingyang Li, Sitong Yan, Guangneng Hu
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Natural language explanation in visual question answer (VQANLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised Multi-level Contrastive Learning based natural language Explanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks.



Paperid:316
Authors:Jinxiang Lai, Wenlong Wu, Bin-Bin Gao, Jun Liu, Jiawei Zhan, Congchong Nie, Yi Zeng, Chengjie Wang
Tencent, Tencent, Tencent, Tencent, Tencent, Tencent, Tencent, Tencent Shanghai Jiao Tong University
Abstract:
Image matching and object detection are two fundamental and challenging tasks, while many related applications consider them two individual tasks (i.e. taskindividual). In this paper, a collaborative framework called MatchDet (i.e. task-collaborative) is proposed for image matching and object detection to obtain mutual improvements. To achieve the collaborative learning of the two tasks, we propose three novel modules, including a Weighted Spatial Attention Module (WSAM) for Detector, and Weighted Attention Module (WAM) and Box Filter for Matcher. Specifically, the WSAM highlights the foreground regions of target image to benefit the subsequent detector, the WAM enhances the connection between the foreground regions of pair images to ensure high-quality matches, and Box Filter mitigates the impact of false matches. We evaluate the approaches on a new benchmark with two datasets called Warp-COCO and miniScanNet. Experimental results show our approaches are effective and achieve competitive improvements.



Paperid:317
Authors:Danning Lao, Qi Liu, Jiazi Bu, Junchi Yan, Wei Shen
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to posthoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.



Paperid:318
Authors:Minh-Quan Le, Tam V. Nguyen, Trung-Nghia Le, Thanh-Toan Do, Minh N. Do, Minh-Triet Tran
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam Stony Brook University, United States, University of Dayton, United States, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam, Monash University, Australia, University of Illinois at Urbana-Champaign, United States, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam
Abstract:
Fewshot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (e.g. mean of K-shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and K-shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods. The source code is available at: https://github.com/minhquanlecs/MaskDiff.



Paperid:319
Authors:Chanho Lee, Jinsu Son, Hyounguk Shon, Yunho Jeon, Junmo Kim
KAIST, KAIST, KAIST, Hanbat University, KAIST
Abstract:
Rotationequivariance is an essential yet challenging property in oriented object detection. While general object detectors naturally leverage robustness to spatial shifts due to the translation-equivariance of the conventional CNNs, achieving rotation-equivariance remains an elusive goal. Current detectors deploy various alignment techniques to derive rotation-invariant features, but still rely on high capacity models and heavy data augmentation with all possible rotations. In this paper, we introduce a Fully Rotation-Equivariant Oriented Object Detector (FRED), whose entire process from the image to the bounding box prediction is strictly equivariant. Specifically, we decouple the invariant task (object classification) and the equivariant task (object localization) to achieve end-to-end equivariance. We represent the bounding box as a set of rotation-equivariant vectors to implement rotation-equivariant localization. Moreover, we utilized these rotation-equivariant vectors as offsets in the deformable convolution, thereby enhancing the existing advantages of spatial adaptation. Leveraging full rotation-equivariance, our FRED demonstrates higher robustness to image-level rotation compared to existing methods. Furthermore, we show that FRED is one step closer to non-axis aligned learning through our experiments. Compared to state-of-the-art methods, our proposed method delivers comparable performance on DOTA-v1.0 and outperforms by 1.5 mAP on DOTA-v1.5, all while significantly reducing the model parameters to 16%.



Paperid:320
Authors:Ingyun Lee, Wooju Lee, Hyun Myung
Korea Advanced Institute of Science and Technology (KAIST), Korea Advanced Institute of Science and Technology (KAIST), Korea Advanced Institute of Science and Technology (KAIST)
Abstract:
Deep neural networks have shown remarkable performance in image classification. However, their performance significantly deteriorates with corrupted input data. Domain generalization methods have been proposed to train robust models against outof-distribution data. Data augmentation in the frequency domain is one of such approaches that enable a model to learn phase features to establish domain-invariant representations. This approach changes the amplitudes of the input data while preserving the phases. However, using fixed phases leads to susceptibility to phase fluctuations because amplitudes and phase fluctuations commonly occur in out-of-distribution. In this study, to address this problem, we introduce an approach using finite variation of the phases of input data rather than maintaining fixed phases. Based on the assumption that the degree of domain-invariant features varies for each phase, we propose a method to distinguish phases based on this degree. In addition, we propose a method called vital phase augmentation (VIPAug) that applies the variation to the phases differently according to the degree of domain-invariant features of given phases. The model depends more on the vital phases that contain more domain-invariant features for attaining robustness to amplitude and phase fluctuations. We present experimental evaluations of our proposed approach, which exhibited improved performance for both clean and corrupted data. VIPAug achieved SOTA performance on the benchmark CIFAR-10 and CIFAR-100 datasets, as well as near-SOTA performance on the ImageNet-100 and ImageNet datasets. Our code is available at https://github.com/excitedkid/vipaug.



Paperid:321
Authors:Jae Young Lee, Woonghyun Ka, Jaehyun Choi, Junmo Kim
KAIST, Hyundai Motor Company, KAIST, KAIST
Abstract:
We propose a novel stereoconfidence that can be measured externally to various stereo-matching networks, offering an alternative input modality choice of the cost volume for learning-based approaches, especially in safety-critical systems. Grounded in the foundational concepts of disparity definition and the disparity plane sweep, the proposed stereo-confidence method is built upon the idea that any shift in a stereo-image pair should be updated in a corresponding amount shift in the disparity map. Based on this idea, the proposed stereo-confidence method can be summarized in three folds. 1) Using the disparity plane sweep, multiple disparity maps can be obtained and treated as a 3-D volume (predicted disparity volume), like the cost volume is constructed. 2) One of these disparity maps serves as an anchor, allowing us to define a desirable (or ideal) disparity profile at every spatial point. 3) By comparing the desirable and predicted disparity profiles, we can quantify the level of matching ambiguity between left and right images for confidence measurement. Extensive experimental results using various stereo-matching networks and datasets demonstrate that the proposed stereo-confidence method not only shows competitive performance on its own but also consistent performance improvements when it is used as an input modality for learning-based stereo-confidence methods.



Paperid:322
Authors:JongMin Lee, Yohann Cabon, Romain Brégier, Sungjoo Yoo, Jerome Revaud
Seoul National University, Naver Labs Europe, Naver Labs Europe, Seoul National University, Naver Labs Europe
Abstract:
Existing learningbased methods for object pose estimation in RGB images are mostly model-specific or category based. They lack the capability to generalize to new object categories at test time, hence severely hindering their practicability and scalability. Notably, recent attempts have been made to solve this issue, but they still require accurate 3D data of the object surface at both train and test time. In this paper, we introduce a novel approach that can estimate in a single forward pass the pose of objects never seen during training, given minimum input. In contrast to existing state-of-the-art approaches, which rely on task-specific modules, our proposed model is entirely based on a transformer architecture, which can benefit from recently proposed 3D-geometry general pretraining. We conduct extensive experiments and report state-of-the-art one-shot performance on the challenging LINEMOD benchmark. Finally, extensive ablations allow us to determine good practices with this relatively new type of architecture in the field.



Paperid:323
Authors:MinKyu Lee, Jae-Pil Heo
Sungkyunkwan University, Sungkyunkwan University
Abstract:
Recent deeplearning-based single image super-resolution (SISR) methods have shown impressive performance whereas typical methods train their networks by minimizing the pixel-wise distance with respect to a given high-resolution (HR) image. However, despite the basic training scheme being the predominant choice, its use in the context of ill-posed inverse problems has not been thoroughly investigated. In this work, we aim to provide a better comprehension of the underlying constituent by decomposing target HR images into two subcomponents: (1) the optimal centroid which is the expectation over multiple potential HR images, and (2) the inherent noise defined as the residual between the HR image and the centroid. Our findings show that the current training scheme cannot capture the ill-posed nature of SISR and becomes vulnerable to the inherent noise term, especially during early training steps. To tackle this issue, we propose a novel optimization method that can effectively remove the inherent noise term in the early steps of vanilla training by estimating the optimal centroid and directly optimizing toward the estimation. Experimental results show that the proposed method can effectively enhance the stability of vanilla training, leading to overall performance gain. Codes are available at github.com/2minkyulee/ECO.



Paperid:324
Authors:Seokjun Lee, Seung-Won Jung, Hyunseok Seo
Korea Institute of Science and Technology Korea University, Korea University, Korea Institute of Science and Technology
Abstract:
Currently, image generation and synthesis have remarkably progressed with generative models. Despite photorealistic results, intrinsic discrepancies are still observed in the frequency domain. The spectral discrepancy appeared not only in generative adversarial networks but in diffusion models. In this study, we propose a framework to effectively mitigate the disparity in frequency domain of the generated images to improve generative performance of both GAN and diffusion models. This is realized by spectrum translation for the refinement of image generation (STIG) based on contrastive learning. We adopt theoretical logic of frequency components in various generative networks. The key idea, here, is to refine the spectrum of the generated image via the concept of image-to-image translation and contrastive learning in terms of digital signal processing. We evaluate our framework across eight fake image datasets and various cutting-edge models to demonstrate the effectiveness of STIG. Our framework outperforms other cutting-edges showing significant decreases in FID and log frequency distance of spectrum. We further emphasize that STIG improves image quality by decreasing the spectral anomaly. Additionally, validation results present that the frequency-based deepfake detector confuses more in the case where fake spectrums are manipulated by STIG.



Paperid:325
Authors:SeokYeong Lee, JunYong Choi, Seungryong Kim, Ig-Jae Kim, Junghyun Cho
Korea Institute of Science and Technology, Seoul Korea University, Seoul, Korea Institute of Science and Technology, Seoul Korea University, Seoul, Korea University, Seoul, Korea Institute of Science and Technology, Seoul AI-Robotics, KIST School, University of Science and Technology Yonsei-KIST Convergence Research Institute, Yonsei University, Korea Institute of Science and Technology, Seoul AI-Robotics, KIST School, University of Science and Technology Yonsei-KIST Convergence Research Institute, Yonsei University
Abstract:
In this paper, we introduce a new challenge for synthesizing novel view images in practical environments with limited input multiview images and varying lighting conditions. Neural radiance fields (NeRF), one of the pioneering works for this task, demand an extensive set of multi-view images taken under constrained illumination, which is often unattainable in real-world settings. While some previous works have managed to synthesize novel views given images with different illumination, their performance still relies on a substantial number of input multi-view images. To address this problem, we suggest ExtremeNeRF, which utilizes multi-view albedo consistency, supported by geometric alignment. Specifically, we extract intrinsic image components that should be illumination-invariant across different views, enabling direct appearance comparison between the input and novel view under unconstrained illumination. We offer thorough experimental results for task evaluation, employing the newly created NeRF Extreme benchmark—the first in-the-wild benchmark for novel view synthesis under multiple viewing directions and varying illuminations.



Paperid:326
Authors:Wooju Lee, Dasol Hong, Hyungtae Lim, Hyun Myung
Urban Robotics Lab, School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea, Urban Robotics Lab, School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea, Urban Robotics Lab, School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea, Urban Robotics Lab, School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea
Abstract:
Singledomain generalization (S-DG) aims to generalize a model to unseen environments with a single-source domain. However, most S-DG approaches have been conducted in the field of classification. When these approaches are applied to object detection, the semantic features of some objects can be damaged, which can lead to imprecise object localization and misclassification. To address these problems, we propose an object-aware domain generalization (OA-DG) method for single-domain generalization in object detection. Our method consists of data augmentation and training strategy, which are called OA-Mix and OA-Loss, respectively. OA-Mix generates multi-domain data with multi-level transformation and object-aware mixing strategy. OA-Loss enables models to learn domain-invariant representations for objects and backgrounds from the original and OA-Mixed images. Our proposed method outperforms state-of-the-art works on standard benchmarks. Our code is available at https://github.com/WoojuLee24/OA-DG.



Paperid:327
Authors:Saebom Leem, Hyunseok Seo
Korea Institute of Science and Technology Sogang University, Korea Institute of Science and Technology
Abstract:
Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViTbased architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.



Paperid:328
Authors:Johannes Lehner, Benedikt Alkin, Andreas Fürst, Elisabeth Rumetshofer, Lukas Miklautz, Sepp Hochreiter
ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning Johannes Kepler University, Linz, Austria, ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning Johannes Kepler University, Linz, Austria, ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning Johannes Kepler University, Linz, Austria, ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning Johannes Kepler University, Linz, Austria, Faculty of Computer Science, University of Vienna, Vienna, Austria UniVie Doctoral School Computer Science, University of Vienna, ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning Johannes Kepler University, Linz, Austria Institute of Advanced Research in Artificial Intelligence (IARAI)
Abstract:
Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE), efficiently learn a rich representation of the input. However, for adapting to downstream tasks, they require a sufficient amount of labeled data since their rich features code not only objects but also less relevant image background. In contrast, Instance Discrimination (ID) methods focus on objects. In this work, we study how to combine the efficiency and scalability of MIM with the ability of ID to perform downstream classification in the absence of large amounts of labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning (MAECT), a sequential approach that utilizes the implicit clustering of the Nearest Neighbor Contrastive Learning (NNCLR) objective to induce abstraction in the topmost layers of a pre-trained MAE. MAE-CT tunes the rich features such that they form semantic clusters of objects without using any labels. Notably, MAE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip). Further, MAE-CT is compute efficient as it requires at most 10% overhead compared to MAE re-training. Applied to large and huge Vision Transformer (ViT) models, MAE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy. With ViT-H/16 MAE-CT achieves a new state-of-the-art in linear probing of 82.2%. Project page: github.com/ml-jku/MAE-CT.



Paperid:329
Authors:Qinqian Lei, Bo Wang, Robby T. Tan
National University of Singapore, CtrsVision, National University of Singapore
Abstract:
Detecting humanobject interactions (HOI) in a few-shot setting remains a challenge. Existing meta-learning methods struggle to extract representative features for classification due to the limited data, while existing few-shot HOI models rely on HOI text labels for classification. Moreover, some query images may display visual similarity to those outside their class, such as similar backgrounds between different HOI classes. This makes learning more challenging, especially with limited samples. Bongard-HOI epitomizes this HOI few-shot problem, making it the benchmark we focus on in this paper. In our proposed method, we introduce novel label-uncertain query augmentation techniques to enhance the diversity of the query inputs, aiming to distinguish the positive HOI class from the negative ones. As these augmented inputs may or may not have the same class label as the original inputs, their class label is unknown. Those belonging to a different class become hard samples due to their visual similarity to the original ones. Additionally, we introduce a novel pseudo-label generation technique that enables a mean teacher model to learn from the augmented label-uncertain inputs. We propose to augment the negative support set for the student model to enrich the semantic information, fostering diversity that challenges and enhances the student’s learning. Experimental results demonstrate that our method sets a new state-of-the-art (SOTA) performance by achieving 68.74% accuracy on the Bongard-HOI benchmark, a significant improvement over the existing SOTA of 66.59%. In our evaluation on HICO-FS, a more general few-shot recognition dataset, our method achieves 73.27% accuracy, outperforming the previous SOTA of 71.20% in the 5- way 5-shot task.



Paperid:330
Authors:Yicheng Leng, Chaowei Fang, Gen Li, Yixiang Fang, Guanbin Li
School of Artificial Intelligence, Xidian University, Xi’an, China School of Data Science, The Chinese University of Hong Kong, Shenzhen, China, School of Artificial Intelligence, Xidian University, Xi’an, China, School of Artificial Intelligence, Xidian University, Xi’an, China Afirstsoft, Shenzhen, China, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China, School of Computer Science and Engineering, Research Institute of Sun Yat-sen University in Shenzhen, Sun Yat-sen University, Guangzhou, China GuangDong Province Key Laboratory of Information Security Technology
Abstract:
Visible watermarks, while instrumental in protecting image copyrights, frequently distort the underlying content, complicating tasks like scene interpretation and image editing. Visible watermark removal aims to eliminate the interference of watermarks and restore the background content. However, existing methods often implement watermark component removal and background restoration tasks within a singular branch, leading to residual watermarks in the predictions and ignoring cases where watermarks heavily obscure the background. To address these limitations, this study introduces the Removing Interference and Recovering Content Imaginatively (RIRCI) framework. RIRCI embodies a twostage approach: the initial phase centers on discerning and segregating the watermark component, while the subsequent phase focuses on background content restoration. To achieve meticulous background restoration, our proposed model employs a dual-path network capable of fully exploring the intrinsic background information beneath semi-transparent watermarks and peripheral contextual information from unaffected regions. Moreover, a Global and Local Context Interaction module is built upon multi-layer perceptrons and bidirectional feature transformation for comprehensive representation modeling in the background restoration phase. The efficacy of our approach is empirically validated across two large-scale datasets, and our findings reveal a marked enhancement over existing watermark removal techniques.



Paperid:331
Authors:Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
The Hebrew University of Jerusalem, Israel, OriginAI, Israel, OriginAI, Israel, The Hebrew University of Jerusalem, Israel
Abstract:
The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pretraining on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.



Paperid:332
Authors:Bao Li, Zhenyu Liu, Lizhi Shao, Bensheng Qiu, Hong Bu, Jie Tian
Center for Biomedical Imaging, University of Science and Technology of China, Hefei, China CAS Key Laboratory of Molecular Imaging, Beijing Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China, CAS Key Laboratory of Molecular Imaging, Beijing Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China, CAS Key Laboratory of Molecular Imaging, Beijing Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China, Center for Biomedical Imaging, University of Science and Technology of China, Hefei, China, Department of Pathology, West China Hospital, Sichuan University, Chengdu, China, Center for Biomedical Imaging, University of Science and Technology of China, Hefei, China CAS Key Laboratory of Molecular Imaging, Beijing Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing, China Key Laboratory of Big Data-Based Precision Medicine, Ministry of Industry and Information Technology, School of Engineering Medicine, Beihang University, Beijing, China
Abstract:
Directly predicting human epidermal growth factor receptor 2 (HER2) status from widely available hematoxylin and eosin (HE)stained whole slide images (WSIs) can reduce technical costs and expedite treatment selection. Accurately predicting HER2 requires large collections of multi-site WSIs. Federated learning enables collaborative training of these WSIs without gigabyte-size WSIs transportation and data privacy concerns. However, federated learning encounters challenges in addressing label imbalance in multi-site WSIs from the real world. Moreover, existing WSI classification methods cannot simultaneously exploit local context information and long-range dependencies in the site-end feature representation of federated learning. To address these issues, we present a point transformer with federated learning for multi-site HER2 status prediction from HE-stained WSIs. Our approach incorporates two novel designs. We propose a dynamic label distribution strategy and an auxiliary classifier, which helps to establish a well-initialized model and mitigate label distribution variations across sites. Additionally, we propose a farthest cosine sampling based on cosine distance. It can sample the most distinctive features and capture the long-range dependencies. Extensive experiments and analysis show that our method achieves state-of-the-art performance at four sites with a total of 2687 WSIs. Furthermore, we demonstrate that our model can generalize to two unseen sites with 229 WSIs. Code is available at: https://github.com/boyden/PointTransformerFL



Paperid:333
Authors:Bin Li, Ye Shi, Qian Yu, Jingya Wang
ShanghaiTech University, ShanghaiTech University, Beihang University, ShanghaiTech University
Abstract:
Unsupervised crossdomain image retrieval (UCIR) aims to retrieve images sharing the same category across diverse domains without relying on labeled data. Prior approaches have typically decomposed the UCIR problem into two distinct tasks: intra-domain representation learning and cross-domain feature alignment. However, these segregated strategies overlook the potential synergies between these tasks. This paper introduces ProtoOT, a novel Optimal Transport formulation explicitly tailored for UCIR, which integrates intra-domain feature representation learning and cross-domain alignment into a unified framework. ProtoOT leverages the strengths of the K-means clustering method to effectively manage distribution imbalances inherent in UCIR. By utilizing K-means for generating initial prototypes and approximating class marginal distributions, we modify the constraints in Optimal Transport accordingly, significantly enhancing its performance in UCIR scenarios. Furthermore, we incorporate contrastive learning into the ProtoOT framework to further improve representation learning. This encourages local semantic consistency among features with similar semantics, while also explicitly enforcing separation between features and unmatched prototypes, thereby enhancing global discriminativeness. ProtoOT surpasses existing state-of-the-art methods by a notable margin across benchmark datasets. Notably, on DomainNet, ProtoOT achieves an average P@200 enhancement of 24.44%, and on Office-Home, it demonstrates a P@15 improvement of 12.12%. Code is available at https://github.com/HCVLAB/ProtoOT.



Paperid:334
Authors:Bohan Li, Xiao Xu, Xinghao Wang, Yutai Hou, Yunlong Feng, Feng Wang, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology
Abstract:
Existing image augmentation methods consist of two categories: perturbationbased methods and generative methods. Perturbation-based methods apply pre-defined perturbations to augment an original image, but only locally vary the image, thus lacking image diversity. In contrast, generative methods bring more image diversity in the augmented images but may not preserve semantic consistency, thus may incorrectly change the essential semantics of the original image. To balance image diversity and semantic consistency in augmented images, we propose SGID, a Semantic-guided Generative Image augmentation method with Diffusion models for image classification. Specifically, SGID employs diffusion models to generate augmented images with good image diversity. More importantly, SGID takes image labels and captions as guidance to maintain semantic consistency between the augmented and original images. Experimental results show that SGID outperforms the best augmentation baseline by 1.72% on ResNet-50 (from scratch), 0.33% on ViT (ImageNet-21k), and 0.14% on CLIP-ViT (LAION-2B). Moreover, SGID can be combined with other image augmentation baselines and further improves the overall performance. We demonstrate the semantic consistency and image diversity of SGID through quantitative human and automated evaluations, as well as qualitative case studies.



Paperid:335
Authors:Bohan Li, Yasheng Sun, Jingxin Dong, Zheng Zhu, Jinming Liu, Xin Jin, Wenjun Zeng
Shanghai Jiao Tong University, Shanghai, China Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China, Tokyo Institute of Technology, Tokyo, Japan, Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China, PhiGent Robotics, Beijing, China, Shanghai Jiao Tong University, Shanghai, China Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China, Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China, Shanghai Jiao Tong University, Shanghai, China Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Abstract:
Numerous studies have investigated the pivotal role of reliable 3D volume representation in scene perception tasks, such as multiview stereo (MVS) and semantic scene completion (SSC). They typically construct 3D probability volumes directly with geometric correspondence, attempting to fully address the scene perception tasks in a single forward pass. However, such a single-step solution makes it hard to learn accurate and convincing volumetric probability, especially in challenging regions like unexpected occlusions and complicated light reflections. Therefore, this paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps to facilitate fine and reliable scene perception. Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD, dedicated to progressively refining the Volumetric Probability in a Diffusion process. Specifically, we first build a coarse probability volume from input images with the off-the-shelf scene perception baselines, which is then conditioned as the basic geometry prior before being fed into a 3D diffusion UNet, to progressively achieve accurate probability distribution modeling. To handle the corner cases in challenging areas, a Confidence-Aware Contextual Collaboration (CACC) module is developed to correct the uncertain regions for reliable volumetric learning based on multi-scale contextual contents. Moreover, an Online Filtering (OF) strategy is designed to maintain representation consistency for stable diffusion sampling. Extensive experiments are conducted on scene perception tasks including multi-view stereo (MVS) and semantic scene completion (SSC), to validate the efficacy of our method in learning reliable volumetric representations. Notably, for the SSC task, our work stands out as the first to surpass LiDAR-based methods on the SemanticKITTI dataset.



Paperid:336
Authors:Dongze Li, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, Jing Dong, Tieniu Tan
School of Artificial Intelligence, University of Chinese Academy of Sciences CRIPAC & MAIS, Institute of Automation, Chinese Academy of Sciences, Alibaba Group, CRIPAC & MAIS, Institute of Automation, Chinese Academy of Sciences, CRIPAC & MAIS, Institute of Automation, Chinese Academy of Sciences, Alibaba Group, CRIPAC & MAIS, Institute of Automation, Chinese Academy of Sciences, CRIPAC & MAIS, Institute of Automation, Chinese Academy of Sciences Nanjing University
Abstract:
Audiodriven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with few-shot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.



Paperid:337
Authors:Hanhui Li, Xiaojian Lin, Xuan Huang, Zejun Yang, Zhisheng Wang, Xiaodan Liang
Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China, Tencent, Shenzhen, China, Tencent, Shenzhen, China, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China DarkMatter AI Research, Guangzhou, China
Abstract:
Current parametric models have made notable progress in 3D hand pose and shape estimation. However, due to the fixed hand topology and complex hand poses, current models are hard to generate meshes that are aligned with the image well. To tackle this issue, we introduce a dual noise estimation method in this paper. Given a singleview image as input, we first adopt a baseline parametric regressor to obtain the coarse hand meshes. We assume the mesh vertices and their image-plane projections are noisy, and can be associated in a unified probabilistic model. We then learn the distributions of noise to refine mesh vertices and their projections. The refined vertices are further utilized to refine camera parameters in a closed-form manner. Consequently, our method obtains well-aligned and high-quality 3D hand meshes. Extensive experiments on the large-scale Interhand2.6M dataset demonstrate that the proposed method not only improves the performance of its baseline by more than 10% but also achieves state-of-the-art performance. Project page: https://github.com/hanhuili/DNE4Hand.



Paperid:338
Authors:Hanxuan Li, Bin Fu, Ruiping Wang, Xilin Chen
Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Recognition in openworld scenarios is an important and challenging field, where Vision-Language Pre-training paradigms have greatly impacted the 2D domain. This inspires a growing interest in introducing 2D pre-trained models, such as CLIP, into the 3D domain to enhance the ability of point cloud understanding. Considering the difference between discrete 3D point clouds and real-world 2D images, reducing the domain gap is crucial. Some recent works project point clouds onto a 2D plane to enable 3D zero-shot capabilities without training. However, this simplistic approach leads to an unclear or even distorted geometric structure, limiting the potential of 2D pre-trained models in 3D. To address the domain gap, we propose Point2Real, a training-free framework based on the realistic rendering technique to automate the transformation of the 3D point cloud domain into the Vision-Language domain. Specifically, Point2Real leverages a shape recovery module that devises an iterative ball-pivoting algorithm to convert point clouds into meshes, narrowing the gap in shape at first. To simulate photo-realistic images, a set of refined textures as candidates is applied for rendering, where the CLIP confidence is utilized to select the suitable one. Moreover, to tackle the viewpoint challenge, a heuristic multi-view adapter is implemented for feature aggregation, which exploits the depth surface as an effective indicator of view-specific discriminability for recognition. We conduct experiments on ModelNet10, ModelNet40, and ScanObjectNN datasets, and the results demonstrate that Point2Real outperforms other approaches in zero-shot and few-shot tasks by a large margin.



Paperid:339
Authors:Hao Li, Mengqi Huang, Lei Zhang, Bo Hu, Yi Liu, Zhendong Mao
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, State Key Laboratory of Communication Content Cognition, University of Science and Technology of China
Abstract:
GANbased image attribute editing firstly leverages GAN Inversion to project real images into the latent space of GAN and then manipulates corresponding latent codes. Recent inversion methods mainly utilize additional high-bit features to improve image details preservation, as low-bit codes cannot faithfully reconstruct source images, leading to the loss of details. However, during editing, existing works fail to accurately complement the lost details and suffer from poor editability. The main reason is they inject all the lost details indiscriminately at one time, which inherently induces the position and quantity of details to overfit source images, resulting in inconsistent content and artifacts in edited images. This work argues that details should be gradually injected into both the reconstruction and editing process in a multi-stage coarse-to-fine manner for better detail preservation and high editability. Therefore, a novel dual-stream framework is proposed to accurately complement details at each stage. The Reconstruction Stream is employed to embed coarse-to-fine lost details into residual features and then adaptively add them to the GAN generator. In the Editing Stream, residual features are accurately aligned by our Selective Attention mechanism and then injected into the editing process in a multi-stage manner. Extensive experiments have shown the superiority of our framework in both reconstruction accuracy and editing quality compared with existing methods.



Paperid:340
Authors:Haolong Li, Chenghao Du, Ziheng Jiang, Yifan Zhang, Jiawei Ma, Chen Ye
Tongji University, Tongji University, Tongji University, Tongji University, Tongji University, Tongji University
Abstract:
Automated Chinese ancient character restoration (ACACR) remains a challenging task due to its historical significance and aesthetic complexity. Existing methods are constrained by nonprofessional masks and even overfitting when training on small-scale datasets, which hinder their interdisciplinary application to traditional fields. In this paper, we are proud to introduce the Chinese Ancient Rubbing and Manuscript Character Dataset (ARMCD), which consists of 15,553 real-world ancient single-character images with 42 rubbings and manuscripts, covering the works of over 200 calligraphy artists spanning from 200 to 1,800 AD. We are also dedicated to providing professional synthetic masks by extracting localized erosion from real eroded images. Moreover, we propose DiffACR (Diffusion model for automated Chinese Ancient Character Restoration), a diffusion-based method for the ACACR task. Specifically, we regard the synthesis of eroded images as a special form of cold diffusion on uneroded ones and extract the prior mask directly from the eroded images. Our experiments demonstrate that our method comprehensively outperforms most existing methods on the proposed ARMCD. Dataset and code are available at https://github.com/lhl322001/DiffACR.



Paperid:341
Authors:Hongjie Li, Yao Guo, Xianwei Zheng, Hanjiang Xiong
The State Key Lab. LIESMARS, Wuhan University, The State Key Lab. LIESMARS, Wuhan University, The State Key Lab. LIESMARS, Wuhan University, The State Key Lab. LIESMARS, Wuhan University
Abstract:
This paper introduces a learnable Deformable Hypothesis Sampler (DeformSampler) to address the challenging issue of noisy depth estimation in faithful PatchMatch multiview stereo (MVS). We observe that the heuristic depth hypothesis sampling modes employed by PatchMatch MVS solvers are insensitive to (i) the piece-wise smooth distribution of depths across the object surface and (ii) the implicit multi-modal distribution of depth prediction probabilities along the ray direction on the surface points. Accordingly, we develop DeformSampler to learn distribution-sensitive sample spaces to (i) propagate depths consistent with the scene's geometry across the object surface and (ii) fit a Laplace Mixture model that approaches the point-wise probabilities distribution of the actual depths along the ray direction. We integrate DeformSampler into a learnable PatchMatch MVS system to enhance depth estimation in challenging areas, such as piece-wise discontinuous surface boundaries and weakly-textured regions. Experimental results on DTU and Tanks & Temples datasets demonstrate its superior performance and generalization capabilities compared to state-of-the-art competitors. Code is available at https://github.com/Geo-Tell/DS-PMNet.



Paperid:342
Authors:Huafeng Li, Qingsong Hu, Zhanxuan Hu
Kunming University of Science and Technology, Kunming University of Science and Technology, School of Information Science and Technology, Yunnan Normal University
Abstract:
Clusteringbased methods are emerging as a ubiquitous technology in unsupervised object Re-Identification (ReID), which alternate between pseudo-label generation and representation learning. Recent advances in this field mainly fall into two groups: pseudo-label correction and robust representation learning. Differently, in this work, we improve unsupervised object ReID from feature calibration, a completely different but complementary insight from the current approaches. Specifically, we propose to insert a conceptually simple yet empirically powerful Feature Calibration Module (FCM) before pseudo-label generation. In practice, FCM calibrates the features using a nonparametric graph attention network, enforcing similar instances to move together in the feature space while allowing dissimilar instances to separate. As a result, we can generate more reliable pseudo-labels using the calibrated features and further improve subsequent representation learning. FCM is simple, effective, parameter-free, training-free, plug-and-play, and can be considered as a catalyst, increasing the ’chemical reaction’ between pseudo-label generation and representation learning. Moreover, it maintains the efficiency of testing time with negligible impact on training time. In this paper, we insert FCM into a simple baseline. Experiments across different scenarios and benchmarks show that FCM consistently improves the baseline (e.g., 8.2% mAP gain on MSMT17), and achieves the new state-of-the-art results. Code is available at: https://github.com/lhf12278/FCM-ReID.



Paperid:343
Authors:Jiafeng Li, Zelin Li, Ying Wen
East China Normal University, New York University Shanghai, East China Normal University
Abstract:
Deep neural networks (DNNs) have achieved remarkable success in various fields, and two powerful techniques, feature normalization and attention mechanisms, have been widely used to enhance model performance. However, they are usually considered as two separate approaches or combined in a simplistic manner. In this paper, we investigate the intrinsic relationship between feature normalization and attention mechanisms and propose an Efficient Attention module guided by Normalization, dubbed EAN. Instead of using costly fullyconnected layers for attention learning, EAN leverages the strengths of feature normalization and incorporates an Attention Generation (AG) unit to re-calibrate features. The proposed AG unit exploits the normalization component as a measure of the importance of distinct features and generates an attention mask using GroupNorm, L2 Norm, and Adaptation operations. By employing a grouping, AG unit and aggregation strategy, EAN is established, offering a unified module that harnesses the advantages of both normalization and attention, while maintaining minimal computational overhead. Furthermore, EAN serves as a plug-and-play module that can be seamlessly integrated with classic backbone architectures. Extensive quantitative evaluations on various visual tasks demonstrate that EAN achieves highly competitive performance compared to the current state-of-the-art attention methods while sustaining lower model complexity.



Paperid:344
Authors:Jianwu Li, Kaiyue Shi, Guo-Sen Xie, Xiaofeng Liu, Jian Zhang, Tianfei Zhou
Beijing Institute of Technology, Beijing Institute of Technology, Nanjing University of Science and Technology, Hohai University, University of Technology Sydney, Beijing Institute of Technology
Abstract:
The goal of this paper is to alleviate the training cost for fewshot semantic segmentation (FSS) models. Despite that FSS in nature improves model generalization to new concepts using only a handful of test exemplars, it relies on strong supervision from a considerable amount of labeled training data for base classes. However, collecting pixel-level annotations is notoriously expensive and time-consuming, and small-scale training datasets convey low information density that limits test-time generalization. To resolve the issue, we take a pioneering step towards label-efficient training of FSS models from fully unlabeled training data, or additionally a few labeled samples to enhance the performance. This motivates an approach based on a novel unsupervised meta-training paradigm. In particular, the approach first distills pre-trained unsupervised pixel embedding into compact semantic clusters from which a massive number of pseudo meta-tasks is constructed. To mitigate the noise in the pseudo meta-tasks, we further advocate a robust Transformer-based FSS model with a novel prototype-based cross-attention design. Extensive experiments have been conducted on two standard benchmarks, i.e., PASCAL-5i and COCO-20i, and the results show that our method produces impressive performance without any annotations, and is comparable to fully supervised competitors even using only 20% of the annotations. Our code is available at: https://github.com/SSSKYue/UMTFSS.



Paperid:345
Authors:Jichang Li, Guanbin Li, Hui Cheng, Zicheng Liao, Yizhou Yu
School of Computer Science and Engineering, Research Institute of Sun Yat-sen University in Shenzhen, Sun Yat-sen University, Guangzhou, China Department of Computer Science, The University of Hong Kong, Hong Kong, School of Computer Science and Engineering, Research Institute of Sun Yat-sen University in Shenzhen, Sun Yat-sen University, Guangzhou, China Guangdong Province Key Laboratory of Information Security Technology, UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China, Department of Computer Science, The University of Hong Kong, Hong Kong Zhejiang University, Department of Computer Science, The University of Hong Kong, Hong Kong
Abstract:
Federated Learning with Noisy Labels (FLNL) aims at seeking an optimal server model via collaborative distributed learning by aggregating multiple client models trained with local noisy or clean samples. On the basis of a federated learning framework, recent advances primarily adopt label noise filtering to separate clean samples from noisy ones on each client, thereby mitigating the negative impact of label noise. However, these prior methods do not learn noise filters by exploiting knowledge across all clients, leading to sub-optimal and inferior noise filtering performance and thus damaging training stability. In this paper, we present FedDiv to tackle the challenges of F-LNL. Specifically, we propose a global noise filter called Federated Noise Filter for effectively identifying samples with noisy labels on every client, thereby raising stability during local training sessions. Without sacrificing data privacy, this is achieved by modeling the global distribution of label noise across all clients. Then, in an effort to make the global model achieve higher performance, we introduce a Predictive Consistency based Sampler to identify more credible local data for local model training, thus preventing noise memorization and further boosting the training stability. Extensive experiments on CIFAR-10, CIFAR-100, and Clothing1M demonstrate that FedDiv achieves superior performance over state-of-the-art F-LNL methods under different label noise settings for both IID and non-IID data partitions. Source code is publicly available at https://github.com/lijichang/FLNL-FedDiv.



Paperid:346
Authors:Jing Li, Junsong Fan, Yuran Yang, Shuqi Mei, Jun Xiao, Zhaoxiang Zhang
University of Chinese Academy of Sciences (UCAS) Institute of Automation, Chinese Academy of Sciences (CASIA) State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Centre for Artificial Intelligence and Robotics, HKISI CAS, Tencent Maps, Tencent, Tencent Maps, Tencent, University of Chinese Academy of Sciences (UCAS), University of Chinese Academy of Sciences (UCAS) Institute of Automation, Chinese Academy of Sciences (CASIA) Centre for Artificial Intelligence and Robotics, HKISI CAS State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS)
Abstract:
The core of pointlysupervised panoptic segmentation is estimating accurate dense pseudo labels from sparse point labels to train the panoptic head. Previous works generate pseudo labels mainly based on hand-crafted rules, such as connecting multiple points into polygon masks, or assigning the label information of labeled pixels to unlabeled pixels based on the artificially defined traversing distance. The accuracy of pseudo labels is limited by the quality of the hand-crafted rules (polygon masks are rough at object contour regions, and the traversing distance error will result in wrong pseudo labels). To overcome the limitation of hand-crafted rules, we estimate pseudo labels with a fully data-driven pseudo label branch, which is optimized by point labels end-to-end and predicts more accurate pseudo labels than previous methods. We also train an auxiliary semantic branch with point labels, it assists the training of the pseudo label branch by transferring semantic segmentation knowledge through shared parameters. Experiments on Pascal VOC and MS COCO demonstrate that our approach is effective and shows state-of-the-art performance compared with related works. Codes are available at https://github.com/BraveGroup/FDD.



Paperid:347
Authors:Kailin Li, Lixin Yang, Zenan Lin, Jian Xu, Xinyu Zhan, Yifei Zhao, Pengxiang Zhu, Wenxiong Kang, Kejian Wu, Cewu Lu
Shanghai Jiao Tong University, Shanghai Jiao Tong University, South China University of Technology, XREAL, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, South China University of Technology, XREAL, Shanghai Jiao Tong University
Abstract:
Rearrangement operations form the crux of interactions between humans and their environment. The ability to generate natural, fluid sequences of this operation is of essential value in AR/VR and CG. Bridging a gap in the field, our study introduces FAVOR: a novel dataset for Fullbody AR-driven Virtual Object Rearrangement that uniquely employs motion capture systems and AR eyeglasses. Comprising 3k diverse motion rearrangement sequences and 7.17 million interaction data frames, this dataset breaks new ground in research data. We also present a pipeline FAVORITE for producing digital human rearrangement motion sequences guided by instructions. Experimental results, both qualitative and quantitative, suggest that this dataset and pipeline deliver high-quality motion sequences. Our dataset, code, and appendix are available at https://kailinli.github.io/FAVOR.



Paperid:348
Authors:Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, Roger Zimmermann
National University of Singapore, National University of Singapore, The University of Sydney, Zhejiang University, National University of Singapore, Hangzhou City University, National University of Singapore
Abstract:
Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for the same object pairs. Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to observe the invariance degree of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potentially biased data. Finally, with the unbiased predicateprototype representation embedding space, biased annotations are easily identified. Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets. Our code is released at https://github.com/lili0415/PSG-biased-annotation.



Paperid:349
Authors:Ru Li, Jia Liu, Guanghui Liu, Shengping Zhang, Bing Zeng, Shuaicheng Liu
Harbin Institute of Technology, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, Harbin Institute of Technology, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
In this paper, we propose SpectralNeRF, an endto-end Neural Radiance Field (NeRF)-based architecture for high-quality physically based rendering from a novel spectral perspective. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of these spectrum maps for the RGB output. Our SpectralNeRF follows these two steps through the proposed multi-layer perceptron (MLP)-based architecture (SpectralMLP) and Spectrum Attention UNet (SAUNet). Given the ray origin and the ray direction, the SpectralMLP constructs the spectral radiance field to obtain spectrum maps of novel views, which are then sent to the SAUNet to produce RGB images of white-light illumination. Applying NeRF to build up the spectral rendering is a more physically-based way from the perspective of ray-tracing. Further, the spectral radiance fields decompose difficult scenes and improve the performance of NeRF-based methods. Comprehensive experimental results demonstrate the proposed SpectralNeRF is superior to recent NeRF-based methods when synthesizing new views on synthetic and real datasets. The codes and datasets are available at https://github.com/liru0126/SpectralNeRF.



Paperid:350
Authors:Shengtao Li, Ge Gao, Yudong Liu, Yu-Shen Liu, Ming Gu
Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China, School of Software, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China School of Software, Tsinghua University, Beijing, China
Abstract:
Implicit neural networks have emerged as a crucial technology in 3D surface reconstruction. To reconstruct continuous surfaces from discrete point clouds, encoding the input points into regular grid features (plane or volume) has been commonly employed in existing approaches. However, these methods typically use the grid as an index for uniformly scattering point features. Compared with the irregular point features, the regular grid features may sacrifice some reconstruction details but improve efficiency. To take full advantage of these two types of features, we introduce a novel and highefficiency attention mechanism between the grid and point features named Point-Grid Transformer (GridFormer). This mechanism treats the grid as a transfer point connecting the space and point cloud. Our method maximizes the spatial expressiveness of grid features and maintains computational efficiency. Furthermore, optimizing predictions over the entire space could potentially result in blurred boundaries. To address this issue, we further propose a boundary optimization strategy incorporating margin binary cross-entropy loss and boundary sampling. This approach enables us to achieve a more precise representation of the object structure. Our experiments validate that our method is effective and outperforms the state-of-the-art approaches under widely used benchmarks by producing more precise geometry reconstructions. The code is available at https://github.com/list17/GridFormer.



Paperid:351
Authors:Shenshen Li, Chen He, Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Textbased person retrieval aims at retrieving a specific pedestrian image from a gallery based on textual descriptions. The primary challenge is how to overcome the inherent heterogeneous modality gap in the situation of significant intra-class variation and minimal inter-class variation. Existing approaches commonly employ vision-language pre-training or attention mechanisms to learn appropriate cross-modal alignments from noise inputs. Despite commendable progress, current methods inevitably suffer from two defects: 1) Matching ambiguity, which mainly derives from unreliable matching pairs; 2) One-sided cross-modal alignments, stemming from the absence of exploring one-to-many correspondence, i.e., coarse-grained semantic alignment. These critical issues significantly deteriorate retrieval performance. To this end, we propose a novel framework termed Adaptive Uncertainty-based Learning (AUL) for text-based person retrieval from the uncertainty perspective. Specifically, our AUL framework consists of three key components: 1) Uncertainty-aware Matching Filtration that leverages Subjective Logic to effectively mitigate the disturbance of unreliable matching pairs and select high-confidence cross-modal matches for training; 2) Uncertainty-based Alignment Refinement, which not only simulates coarse-grained alignments by constructing uncertainty representations but also performs progressive learning to incorporate coarse- and fine-grained alignments properly; 3) Cross-modal Masked Modeling that aims at exploring more comprehensive relations between vision and language. Extensive experiments demonstrate that our AUL method consistently achieves state-of-the-art performance on three benchmark datasets in supervised, weakly supervised, and domain generalization settings. Our code is available at https://github.com/CFM-MSG/Code-AUL.



Paperid:352
Authors:Shujuan Li, Junsheng Zhou, Baorui Ma, Yu-Shen Liu, Zhizhong Han
Tsinghua University, Tsinghua University, Tsinghua University Beijing Academy of Artificial Intelligence, Tsinghua University, Wayne State University
Abstract:
Point cloud upsampling aims to generate dense and uniformly distributed point sets from a sparse point cloud, which plays a critical role in 3D computer vision. Previous methods typically split a sparse point cloud into several local patches, upsample patch points, and merge all upsampled patches. However, these methods often produce holes, outliers or nonuniformity due to the splitting and merging process which does not maintain consistency among local patches.To address these issues, we propose a novel approach that learns an unsigned distance field guided by local priors for point cloud upsampling. Specifically, we train a local distance indicator (LDI) that predicts the unsigned distance from a query point to a local implicit surface. Utilizing the learned LDI, we learn an unsigned distance field to represent the sparse point cloud with patch consistency. At inference time, we randomly sample queries around the sparse point cloud, and project these query points onto the zero-level set of the learned implicit field to generate a dense point cloud. We justify that the implicit field is naturally continuous, which inherently enables the application of arbitrary-scale upsampling without necessarily retraining for various scales. We conduct comprehensive experiments on both synthetic data and real scans, and report state-of-the-art results under widely used benchmarks. Project page: https://lisj575.github.io/APU-LDI



Paperid:353
Authors:Weiqi Li, Fan Lyu, Fanhua Shang, Liang Wan, Wei Feng
College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University CRIPAC, MAIS, CASIA, College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University
Abstract:
Realworld data is extremely imbalanced and presents a long-tailed distribution, resulting in models biased towards classes with sufficient samples and performing poorly on rare classes. Recent methods propose to rebalance classes but they undertake the seesaw dilemma (what is increasing performance on tail classes may decrease that of head classes, and vice versa). In this paper, we argue that the seesaw dilemma is derived from the gradient imbalance of different classes, in which gradients of inappropriate classes are set to important for updating, thus prone to overcompensation or undercompensation on tail classes. To achieve ideal compensation, we formulate long-tailed recognition as a multi-objective optimization problem, which fairly respects the contributions of head and tail classes simultaneously. For efficiency, we propose a Gradient-Balancing Grouping (GBG) strategy to gather the classes with similar gradient directions, thus approximately making every update under a Pareto descent direction. Our GBG method drives classes with similar gradient directions to form a more representative gradient and provides ideal compensation to the tail classes. Moreover, we conduct extensive experiments on commonly used benchmarks in long-tailed learning and demonstrate the superiority of our method over existing SOTA methods. Our code is released at https://github.com/WickyLee1998/GBG_v1.



Paperid:354
Authors:Xi Li, Songhe Wang, Ruiquan Huang, Mahanth Gowda, George Kesidis
The Pennsylvania State University, The Pennsylvania State University, The Pennsylvania State University, The Pennsylvania State University, The Pennsylvania State University
Abstract:
Deep neural networks (DNNs) have achieved tremendous success in various applications including video action recognition, yet remain vulnerable to backdoor attacks (Trojans). The backdoorcompromised model will mis-classify to the target class chosen by the attacker when a test instance (from a non-target class) is embedded with a specific trigger, while maintaining high accuracy on attack-free instances. Although there are extensive studies on backdoor attacks against image data, the susceptibility of video-based systems under backdoor attacks remains largely unexplored. Current studies are direct extensions of approaches proposed for image data, e.g., the triggers are independently embedded within the frames, which tend to be detectable by existing defenses. In this paper, we introduce a simple yet effective backdoor attack against video data. Our proposed attack, adding perturbations in a transformed domain, plants an imperceptible, temporally distributed trigger across the video frames, and is shown to be resilient to existing defensive strategies. The effectiveness of the proposed attack is demonstrated by extensive experiments with various well-known models on two video recognition benchmarks, UCF101 and HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL) dataset. We delve into the impact of several influential factors on our proposed attack and identify an intriguing effect termed "collateral damage" through extensive studies.



Paperid:355
Authors:Xiang Li, Junbo Yin, Wei Li, Chengzhong Xu, Ruigang Yang, Jianbing Shen
Beijing Institute of Technology, Beijing Institute of Technology, Inceptio, University of Macau, Inceptio, University of Macau
Abstract:
Vehicleto-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlapping regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X.



Paperid:356
Authors:Xiawei Li, Qingyuan Xu, Jing Zhang, Tianyi Zhang, Qian Yu, Lu Sheng, Dong Xu
Beihang University, Beihang University, Beihang University, Zhejiang University, Beihang University, Beihang University, The University of Hong Kong
Abstract:
3D point cloud semantic segmentation has a wide range of applications. Recently, weakly supervised point cloud segmentation methods have been proposed, aiming to alleviate the expensive and laborious manual annotation process by leveraging scenelevel labels. However, these methods have not effectively exploited the rich geometric information (such as shape and scale) and appearance information (such as color and texture) present in RGB-D scans. Furthermore, current approaches fail to fully leverage the point affinity that can be inferred from the feature extraction network, which is crucial for learning from weak scene-level labels. Additionally, previous work overlooks the detrimental effects of the long-tailed distribution of point cloud data in weakly supervised 3D semantic segmentation. To this end, this paper proposes a simple yet effective scene-level weakly supervised point cloud segmentation method with a newly introduced multi-modality point affinity inference module. The point affinity proposed in this paper is characterized by features from multiple modalities (e.g., point cloud and RGB), and is further refined by normalizing the classifier weights to alleviate the detrimental effects of long-tailed distribution without the need of the prior of category distribution. Extensive experiments on the ScanNet and S3DIS benchmarks verify the effectiveness of our proposed method, which outperforms the state-of-the-art by ~4% to ~ 6% mIoU. Codes are released at https://github.com/Sunny599/AAAI24-3DWSSG-MMA.



Paperid:357
Authors:Ximeng Li, Chen Zhang, Wanjuan Su, Wenbing Tao
National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China, National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China Tuke Research, National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China, National Key Laboratory of Science and Technology on Multispectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China
Abstract:
Recently, there has been a growing interest in 3D CNNbased stereo matching methods due to their remarkable accuracy. However, the high complexity of 3D convolution makes it challenging to strike a balance between accuracy and speed. Notably, explicit 3D volumes contain considerable redundancy. In this study, we delve into more compact 2D implicit network to eliminate redundancy and boost real-time performance. However, simply replacing explicit 3D networks with 2D implicit networks causes issues that can lead to performance degradation, including the loss of structural information, the quality decline of inter-image information, as well as the inaccurate regression caused by low-level features. To address these issues, we first integrate intra-image information to fuse with inter-image information, facilitating propagation guided by structural cues. Subsequently, we introduce the Fast Multi-scale Score Volume (FMSV) and Confidence Based Filtering (CBF) to efficiently acquire accurate multi-scale, noise-free inter-image information. Furthermore, combined with the Residual Context-aware Upsampler (RCU), our Intra-Inter Fusing network is meticulously designed to enhance information transmission on both feature-level and disparity-level, thereby enabling accurate and robust regression. Experimental results affirm the superiority of our network in terms of both speed and accuracy compared to all other fast methods.



Paperid:358
Authors:Xiutian Li, Siqi Sun, Rui Feng
School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433 Fudan Zhangjiang Institute, Shanghai, 200120 Shanghai Collaborative Innovation Center of Intelligent Visual Computing
Abstract:
Existing causal representation learning methods are based on the causal graph they build. However, due to the omission of bias within the causal graph, they essentially encourage models to learn biased causal effects in latent space. In this paper, we propose a novel causally disentangling framework that aims to learn unbiased causal effects. We first introduce inductive and dataset biases into traditional causal graph for the physical concepts of interest. Then, we eliminate the negative effects from these two biases by counterfactual intervention with reweighted loss function for learning unbiased causal effects. Finally, we employ the causal effects into the VAE to endow the latent representations with causality. In particular, we highlight that removing biases in this paper is regarded as a part of learning process for unbiased causal effects, which is crucial for causal disentanglement performance improvement. Through extensive experiments on realworld and synthetic datasets, we show that our method outperforms different baselines and obtains the state-of-the-art results for achieving causal representation learning.



Paperid:359
Authors:Yanjing Li, Sheng Xu, Mingbao Lin, Xianbin Cao, Chuanjian Liu, Xiao Sun, Baochang Zhang
Beihang University, Beihang University, Tencent, Beihang University, China, Huawei Noah's Ark Lab, Shanghai Artificial Intelligence Laboratory, Zhongguancun Laboratory Hangzhou Research Institute, Beihang University Nanchang Institute of Technology
Abstract:
Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pretrained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4% respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet. Our codes and models are attached on https://github.com/YanjingLi0202/Bi-ViT/ .



Paperid:360
Authors:Yanxi Li, Chengbin Du, Chang Xu
University of Sydney, University of Sydney, University of Sydney
Abstract:
Deep Neural Networks (DNNs) have demonstrated remarkable accuracy in vision classification tasks. However, they exhibit vulnerability to additional noises known as adversarial attacks. Previous studies hypothesize that this vulnerability might stem from the fact that highaccuracy DNNs heavily rely on irrelevant and non-robust features, such as textures and the background. In this work, we reveal that edge information extracted from images can provide relevant and robust features related to shapes and the foreground. These features assist pretrained DNNs in achieving improved adversarial robustness without compromising their accuracy on clean images. A lightweight and plug-and-play EdgeNet is proposed, which can be seamlessly integrated into existing pretrained DNNs, including Vision Transformers, a recent family of state-of-the-art models for vision classification. Our EdgeNet can process edges derived from either clean nature images or noisy adversarial images, yielding robust features which can be injected into the intermediate layers of the frozen backbone DNNs. The cost of obtaining such edges using conventional edge detection algorithms (e.g., Canny edge detector) is marginal, and the cost of training the EdgeNet is equivalent to that of fine-tuning the backbone network with techniques such as Adapter.



Paperid:361
Authors:Yiming Li, Peng Zhou, Jun Sun, Yi Xu
Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China Mobile (Suzhou) Software Technology Co., Ltd, China, Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University, Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Abstract:
Textguided image manipulation has attracted significant attention recently. Prevailing techniques concentrate on image attribute editing for individual objects, however, encountering challenges when it comes to multi-object editing. The main reason is the lack of consistency constraints on the spatial layout. This work presents a multi-region guided image manipulation framework, enabling manipulation through region-level textual prompts. With MultiDiffusion as a baseline, we are dedicated to the automatic generation of a rational multi-object spatial distribution, where disparate regions are fused as a unified entity. To mitigate interference from regional fusion, we employ an off-the-shelf model (CLIP) to impose region-aware spatial guidance on multi-object manipulation. Moreover, when applied to the StableDiffusion, the presence of quality-related yet object-agnostic lengthy words hampers the manipulation. To ensure focus on meaningful object-specific words for efficient guidance and generation, we introduce a keyword selection method. Furthermore, we demonstrate a downstream application of our method for multi-region inversion, which is tailored for manipulating multiple objects in real images. Our approach, compatible with variants of Stable Diffusion models, is readily applicable for manipulating diverse objects in extensive images with high-quality generation, showing superb image control capabilities. Code is available at https://github.com/liyiming09/multi-region-guided-diffusion.



Paperid:362
Authors:Yuelong Li, Tengfei Xiao, Lei Geng, Jianming Wang
School of Artificial Intelligence, Tiangong University Tianjin Key Laboratory of Autonomous Intelligence Technology and Systems, Tiangong University, School of Software, Tiangong University, School of Life Sciences, Tiangong University, Tianjin Key Laboratory of Autonomous Intelligence Technology and Systems, Tiangong University
Abstract:
Pose diversity is an inherent representative characteristic of 2D images. Due to the 3D to 2D projection mechanism, there is evident content discrepancy among distinct pose images. This is the main obstacle bothering pose transformation related researches. To deal with this challenge, we propose a finegrained incremental evolution centered pose generation framework, rather than traditional direct one-to-one in a rush. Since proposed approach actually bypasses the theoretical difficulty of directly modeling dramatic non-linear variation, the incurred content distortion and blurring could be effectively constrained, at the same time the various individual pose details, especially clothes texture, could be precisely maintained. In order to systematically guide the evolution course, both global and incremental evolution constraints are elaborately designed and merged into the overall framework. And a novel triple-path knowledge fusion structure is worked out to take full advantage of all available valuable knowledge to conduct high-quality pose synthesis. In addition, our framework could generate a series of valuable by-products, namely the various intermediate poses. Extensive experiments have been conducted to verify the effectiveness of the proposed approach. Code is available at https://github.com/Xiaofei-CN/Incremental-Evolution-Pose-Generation.



Paperid:363
Authors:Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, Bingbing Ni
Shanghai Jiao Tong University, Huawei, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
While text3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.



Paperid:364
Authors:Zekun Li, Hongying Liu, Fanhua Shang, Yuanyuan Liu, Liang Wan, Wei Feng
Xidian University, Tianjin University, Tianjin University, Xidian University, Tianjin University, Tianjin University
Abstract:
Deep learningbased video super-resolution (VSR) networks have gained significant performance improvements in recent years. However, existing VSR networks can only support a fixed integer scale super-resolution task, and when we want to perform VSR at multiple scales, we need to train several models. This implementation certainly increases the consumption of computational and storage resources, which limits the application scenarios of VSR techniques. In this paper, we propose a novel Scale-adaptive Arbitrary-scale Video Super-Resolution network (SAVSR), which is the first work focusing on spatial VSR at arbitrary scales including both non-integer and asymmetric scales. We also present an omni-dimensional scale-attention convolution, which dynamically adapts according to the scale of the input to extract inter-frame features with stronger representational power. Moreover, the proposed spatio-temporal adaptive arbitrary-scale upsampling performs VSR tasks using both temporal features and scale information. And we design an iterative bi-directional architecture for implicit feature alignment. Experiments at various scales on the benchmark datasets show that the proposed SAVSR outperforms state-of-the-art (SOTA) methods at non-integer and asymmetric scales. The source code is available at https://github.com/Weepingchestnut/SAVSR.



Paperid:365
Authors:Zepeng Li, Dongxiang Zhang, Sai Wu, Mingli Song, Gang Chen
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
MultiObject Tracking (MOT) is a cornerstone operator for video surveillance applications. To enable real-time processing of large-scale live video streams, we study an interesting scenario called down-sampled MOT, which performs object tracking only on a small subset of video frames. The problem is challenging for state-of-the-art MOT methods, which exhibit significant performance degradation under high frame reduction ratios. In this paper, we devise a sampling-resilient tracker with a novel sparse-observation Kalman filter (SOKF). It integrates an LSTM network to capture non-linear and dynamic motion patterns caused by sparse observations. Since the LSTM-based state transition is not compatible with the original noise estimation mechanism, we propose new estimation strategies based on Bayesian neural networks and derive the optimal Kalman gain for SOKF. To associate the detected bounding boxes robustly, we also propose a comprehensive similarity metric that systematically integrates multiple spatial matching signals. Experiments on three benchmark datasets show that our proposed tracker achieves the best trade-off between efficiency and accuracy. With the same tracking accuracy, we reduce the total processing time of ByteTrack by 2× in MOT17 and 3× in DanceTrack.



Paperid:366
Authors:Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang
School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center Anhui Zhonghuitong Technology Co., Ltd, School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
This paper focuses on the AudioVisual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL.



Paperid:367
Authors:Sen Liang, Kai Zhu, Wei Zhai, Zhiheng Liu, Yang Cao
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
Video classincremental learning aims to recognize new actions while restricting the catastrophic forgetting of old ones, whose representative samples can only be saved in limited memory. Semantically variable subactions are susceptible to class confusion due to data imbalance. While existing methods address the problem by estimating and distilling the spatio-temporal knowledge, we further explores that the refinement of hierarchical correlations is crucial for the alignment of spatio-temporal features. To enhance the adaptability on evolved actions, we proposes a hierarchical aggregation strategy, in which hierarchical matching matrices are combined and jointly optimized to selectively store and retrieve relevant features from previous tasks. Meanwhile, a correlation refinement mechanism is presented to reinforce the bias on informative exemplars according to online hypercorrelation distribution. Experimental results demonstrate the effectiveness of the proposed method on three standard video class-incremental learning benchmarks, outperforming state-of-the-art methods. Code is available at: https://github.com/Lsen991031/HCE



Paperid:368
Authors:Yaoyuan Liang, Xiao Liang, Yansong Tang, Zhao Yang, Ziran Li, Jingang Wang, Wenbo Ding, Shao-Lun Huang
Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University, University of Oxford, Meituan, Meituan, Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University
Abstract:
This paper studies the spatiotemporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task.



Paperid:369
Authors:Zhaohuai Liang, Changhe Li
School of Automation, China University of Geosciences, Wuhan 430074, China, School of Artificial Intelligence, Anhui University of Science & Technology, Hefei 232001, China
Abstract:
Due to unaffordable computational costs, the regularized disparity in iterative stereo matching is typically maintained at a lower resolution than the input. To regress the full resolution disparity, most stereo methods resort to convolutions to decode a fixedscale output. However, they are inadequate for recovering vital high-frequency information lost during downsampling, limiting their performance on full-resolution prediction. In this paper, we introduce AnyStereo, an accurate and efficient disparity upsampling module with implicit neural representation for the iterative stereo pipeline. By modeling the disparity as a continuous representation over 2D spatial coordinates, subtle details can emerge from the latent space at arbitrary resolution. To further complement the missing information and details in the latent code, we propose two strategies: intra-scale similarity unfolding and cross-scale feature alignment. The former unfolds the neighbor relationships, while the latter introduces the context in high-resolution feature maps. The proposed AnyStereo can seamlessly replace the upsampling module in most iterative stereo models, improving their ability to capture fine details and generate arbitrary-scale disparities even with fewer parameters. With our method, the iterative stereo pipeline establishes a new state-of-the-art performance. The code is available at https://github.com/Zhaohuai-L/Any-Stereo.



Paperid:370
Authors:Dongping Liao, Xitong Gao, Chengzhong Xu
State Key Lab of IoTSC, Department of Computer and Information Science, University of Macau, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, University of Macau
Abstract:
DataFree Knowledge Distillation (DFKD) enables knowledge transfer from a pretrained teacher to a light-weighted student without original training data. Existing works are limited by a strong assumption that samples used to pretrain the teacher model are balanced, which is, however, unrealistic for many real-world tasks. In this work, we investigated a pragmatic yet under-explored problem: how to perform DFKD from a teacher model pretrained from imbalanced data. We observe a seemingly counter-intuitive phenomenon, i.e., adversarial DFKD algorithms favour minority classes, while causing a disastrous impact on majority classes. We theoretically prove that a biased teacher could cause severe disparity on different groups of synthetic data in adversarial distillation, which further exacerbates the mode collapse of a generator and consequently degenerates the overall accuracy of a distilled student model. To tackle this problem, we propose a class-adaptive regularization method, aiming to encourage impartial representation learning of a generator among different classes under a constrained learning formulation. We devise a primal-dual algorithm to solve the target optimization problem. Through extensive experiments, we show that our method mitigates the biased learning of majority classes in DFKD and improves the overall performance compared with baselines. Code will be available at https://github.com/ldpbuaa/ipad.



Paperid:371
Authors:Guibiao Liao, Jiankun Li, Xiaoqing Ye
School of Electronic and Computer Engineering, Peking University Peng Cheng Laboratory, Baidu Inc., Baidu Inc.
Abstract:
Vision and language foundation models (VLMs) have showcased impressive capabilities in 2D scene understanding. However, their latent potential in elevating the understanding of 3D autonomous driving scenes remains untapped. In this paper, we propose VLM2Scene, which exploits the potential of VLMs to enhance 3D selfsupervised representation learning through our proposed image-text-LiDAR contrastive learning strategy. Specifically, in the realm of autonomous driving scenes, the inherent sparsity of LiDAR point clouds poses a notable challenge for point-level contrastive learning methods. This method often grapples with limitations tied to a restricted receptive field and the presence of noisy points. To tackle this challenge, our approach emphasizes region-level learning, leveraging regional masks without semantics derived from the vision foundation model. This approach capitalizes on valuable contextual information to enhance the learning of point cloud representations. First, we introduce Region Caption Prompts to generate fine-grained language descriptions for the corresponding regions, utilizing the language foundation model. These region prompts then facilitate the establishment of positive and negative text-point pairs within the contrastive loss framework. Second, we propose a Region Semantic Concordance Regularization, which involves a semantic-filtered region learning and a region semantic assignment strategy. The former aims to filter the false negative samples based on the semantic distance, and the latter mitigates potential inaccuracies in pixel semantics, thereby enhancing overall semantic consistency. Extensive experiments on representative autonomous driving datasets demonstrate that our self-supervised method significantly outperforms other counterparts. Codes are available at https://github.com/gbliao/VLM2Scene.



Paperid:372
Authors:Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, Dongmei Zhang
University of Science and Technology of China, Microsoft, Microsoft, Microsoft, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Microsoft, Microsoft
Abstract:
Recent years have witnessed the substantial progress of largescale models across various domains, such as natural language processing and computer vision, facilitating the expression of concrete concepts. Unlike concrete concepts that are usually directly associated with physical objects, expressing abstract concepts through natural language requires considerable effort since they are characterized by intricate semantics and connotations. An alternative approach is to leverage images to convey rich visual information as a supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and often struggle to visualize abstract concepts. Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, we propose a framework of Text-to-Image generation for Concepts (TIAC). The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. LLMs then transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Information from these three aspects will be integrated to generate prompts for T2I models via LLM. Evaluation results from human assessments and our newly designed metric concept score demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.



Paperid:373
Authors:Tangfei Liao, Xiaoqin Zhang, Li Zhao, Tao Wang, Guobao Xiao
Wenzhou University, Wenzhou University, Wenzhou University, Nanjing University, Tongji University
Abstract:
Correspondence pruning aims to find correct matches (inliers) from an initial set of putative correspondences, which is a fundamental task for many applications. The process of finding is challenging, given the varying inlier ratios between scenes/image pairs due to significant visual differences. However, the performance of the existing methods is usually limited by the problem of lacking visual cues (e.g., texture, illumination, structure) of scenes. In this paper, we propose a VisualSpatial Fusion Transformer (VSFormer) to identify inliers and recover camera poses accurately. Firstly, we obtain highly abstract visual cues of a scene with the cross attention between local features of two-view images. Then, we model these visual cues and correspondences by a joint visual-spatial fusion module, simultaneously embedding visual cues into correspondences for pruning. Additionally, to mine the consistency of correspondences, we also design a novel module that combines the KNN-based graph and the transformer, effectively capturing both local and global contexts. Extensive experiments have demonstrated that the proposed VSFormer outperforms state-of-the-art methods on outdoor and indoor benchmarks. Our code is provided at the following repository: https://github.com/sugar-fly/VSFormer.



Paperid:374
Authors:Beibei Lin, Yeying Jin, Wending Yan, Wei Ye, Yuan Yuan, Shunli Zhang, Robby T. Tan
National University of Singapore, National University of Singapore, Huawei International Pte Ltd, Huawei International Pte Ltd, Huawei International Pte Ltd, Beijing Jiaotong University, National University of Singapore
Abstract:
Existing deeplearning-based methods for nighttime video deraining rely on synthetic data due to the absence of real-world paired data. However, the intricacies of the real world, particularly with the presence of light effects and low-light regions affected by noise, create significant domain gaps, hampering synthetic-trained models in removing rain streaks properly and leading to over-saturation and color shifts. Motivated by this, we introduce NightRain, a novel nighttime video deraining method with adaptive-rain-removal and adaptive-correction. Our adaptive-rain-removal uses unlabeled rain videos to enable our model to derain real-world rain videos, particularly in regions affected by complex light effects. The idea is to allow our model to obtain rain-free regions based on the confidence scores. Once rain-free regions and the corresponding regions from our input are obtained, we can have region-based paired real data. These paired data are used to train our model using a teacher-student framework, allowing the model to iteratively learn from less challenging regions to more challenging regions. Our adaptive-correction aims to rectify errors in our model's predictions, such as over-saturation and color shifts. The idea is to learn from clear night input training videos based on the differences or distance between those input videos and their corresponding predictions. Our model learns from these differences, compelling our model to correct the errors. From extensive experiments, our method demonstrates state-of-the-art performance. It achieves a PSNR of 26.73dB, surpassing existing nighttime video deraining methods by a substantial margin of 13.7%.



Paperid:375
Authors:Huangxing Lin, Yuhang Dong, Xinghao Ding, Tianpeng Liu, Yongxiang Liu
National University of Defense Technology, Xiamen University, Xiamen University, National University of Defense Technology, National University of Defense Technology
Abstract:
Pansharpening is a task that aims to super-resolve the low-resolution multispectral (LRMS) image with the guidance of a corresponding high-resolution panchromatic (PAN) image. The key challenge in pan-sharpening is to accurately modeling the relationship between the MS and PAN images. While supervised deep learning methods are commonly employed to address this task, the unavailability of ground-truth severely limits their effectiveness. In this paper, we propose a mutually guided detail restoration method for unsupervised pan-sharpening. Specifically, we treat pan-sharpening as a blind image deblurring task, in which the blur kernel can be estimated by a CNN. Constrained by the blur kernel, the pan-sharpened image retains spectral information consistent with the LRMS image. Once the pan-sharpened image is obtained, the PAN image is blurred using a pre-defined blur operator. The pan-sharpened image, in turn, is used to guide the detail restoration of the blurred PAN image. By leveraging the mutual guidance between MS and PAN images, the pan-sharpening network can implicitly learn the spatial relationship between the two modalities. Extensive experiments show that the proposed method significantly outperforms existing unsupervised pan-sharpening methods.



Paperid:376
Authors:Hui Lin, Zhiheng Ma, Xiaopeng Hong, Qinnan Shangguan, Deyu Meng
Xi’an Jiaotong University, Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences, Harbin Institute of Technology Peng Cheng Laboratory, Harbin Institute of Technology, Xi'an Jiaotong University
Abstract:
Transformer has been popular in recent crowd counting work since it breaks the limited receptive field of traditional CNNs. However, since crowd images always contain a large number of similar patches, the selfattention mechanism in Transformer tends to find a homogenized solution where the attention maps of almost all patches are identical. In this paper, we address this problem by proposing Gramformer: a graph-modulated transformer to enhance the network by adjusting the attention and input node features respectively on the basis of two different types of graphs. Firstly, an attention graph is proposed to diverse attention maps to attend to complementary information. The graph is building upon the dissimilarities between patches, modulating the attention in an anti-similarity fashion. Secondly, a feature-based centrality encoding is proposed to discover the centrality positions or importance of nodes. We encode them with a proposed centrality indices scheme to modulate the node features and similarity relationships. Extensive experiments on four challenging crowd counting datasets have validated the competitiveness of the proposed method. Code is available at https://github.com/LoraLinH/Gramformer.



Paperid:377
Authors:Jianghang Lin, Yunhang Shen, Bingquan Wang, Shaohui Lin, Ke Li, Liujuan Cao
Xiamen University, Tencent, Xiamen University, East China Normal University, Tencent, Xiamen University
Abstract:
Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instancelevel annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).



Paperid:378
Authors:Jieru Lin, Danqing Huang, Tiejun Zhao, Dechen Zhan, Chin-Yew Lin
Harbin Institute of Technology, Microsoft Research, Harbin Institute of Technology, Harbin Institute of Technology, Microsoft Research
Abstract:
Layout generation is a critical step in graphic design to achieve meaningful compositions of elements. Most previous works view it as a sequence generation problem by concatenating element attribute tokens (i.e., category, size, position). So far the autoregressive approach (AR) has achieved promising results, but is still limited in global context modeling and suffers from error propagation since it can only attend to the previously generated tokens. Recent nonautoregressive attempts (NAR) have shown competitive results, which provides a wider context range and the flexibility to refine with iterative decoding. However, current works only use simple heuristics to recognize erroneous tokens for refinement which is inaccurate. This paper first conducts an in-depth analysis to better understand the difference between the AR and NAR framework. Furthermore, based on our observation that pixel space is more sensitive in capturing spatial patterns of graphic layouts (e.g., overlap, alignment), we propose a learning-based locator to detect erroneous tokens which takes the wireframe image rendered from the generated layout sequence as input. We show that it serves as a complementary modality to the element sequence in object space and contributes greatly to the overall performance. Experiments on two public datasets show that our approach outperforms both AR and NAR baselines. Extensive studies further prove the effectiveness of different modules with interesting findings. Our code will be available at https://github.com/ffffatgoose/SpotError.



Paperid:379
Authors:Jinhao Lin, Ziheng Wu, Weifeng Lin, Jun Huang, RongHua Luo
South China University of Technology Alibaba Group, Alibaba Group, South China University of Technology Alibaba Group, Alibaba Group, South China University of Technology
Abstract:
Fewshot Class-incremental learning (FSCIL) is a challenging task in machine learning that aims to recognize new classes from a limited number of instances while preserving the ability to classify previously learned classes without retraining the entire model. This presents challenges in updating the model with new classes using limited training data, particularly in balancing acquiring new knowledge while retaining the old. We propose a novel method named Multiple Mxing Self-Distillation (M2SD) during the training phase to address these issues. Specifically, we propose a dual-branch structure that facilitates the expansion of the entire feature space to accommodate new classes. Furthermore, we introduce a feature enhancement component that can pass additional enhanced information back to the base network by self-distillation, resulting in improved classification performance upon adding new classes. After training, we discard both structures, leaving only the primary network to classify new class instances. Extensive experiments demonstrate that our approach achieves superior performance over previous state-of-the-art methods.



Paperid:380
Authors:Longzhong Lin, Xuewu Lin, Tianwei Lin, Lichao Huang, Rong Xiong, Yue Wang
Zhejiang University Horizon Robotics, Horizon Robotics, Horizon Robotics, Horizon Robotics, Zhejiang University, Zhejiang University
Abstract:
Motion prediction is a crucial task in autonomous driving, and one of its major challenges lands in the multimodality of future behaviors. Many successful works have utilized mixture models which require identification of positive mixture components, and correspondingly fall into two main lines: predictionbased and anchor-based matching. The prediction clustering phenomenon in prediction-based matching makes it difficult to pick representative trajectories for downstream tasks, while the anchor-based matching suffers from a limited regression capability. In this paper, we introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to define the positive and negative components for multimodal motion prediction based on mixture models. We enable anchors to evolve and redistribute themselves under specific scenes for an enlarged regression capacity. Furthermore, we select distinct anchors before matching them with the ground truth, which results in impressive scoring performance. Our approach enhances all metrics compared to the baseline MTR, particularly with a notable relative reduction of 13.5% in Miss Rate, resulting in state-of-the-art performance on the Waymo Open Motion Dataset. Appendix and code are available at https://github.com/Longzhong-Lin/EDA.



Paperid:381
Authors:Luoyang Lin, Zutao Jiang, Xiaodan Liang, Liqian Ma, Michael C. Kampffmeyer, Xiaochun Cao
Shenzhen Campus of Sun Yat-sen University, Mohamed bin Zayed University of Artificial Intelligence, Shenzhen Campus of Sun Yat-sen University Mohamed bin Zayed University of Artificial Intelligence DarkMatter AI Research, Guangzhou, China, ZMO AI Inc., UiT The Arctic University of Norway, Shenzhen Campus of Sun Yat-sen University
Abstract:
Talking upperbody synthesis is a promising task due to its versatile potential for video creation and consists of animating the body and face from a source image with the motion from a given driving video. However, prior synthesis approaches fall short in addressing this task and have been either limited to animating heads of a target person only, or have animated the upper body but neglected the synthesis of precise facial details. To tackle this task, we propose a Photo-realistic Talking Upper-body Synthesis method via 3D-aware motion decomposition warping, named PTUS, to both precisely synthesize the upper body as well as recover the details of the face such as blinking and lip synchronization. In particular, the motion decomposition mechanism consists of a face-body motion decomposition, which decouples the 3D motion estimation of the face and body, and a local-global motion decomposition, which decomposes the 3D face motion into global and local motions resulting in the transfer of facial expression. The 3D-aware warping module transfers the large-scale and subtle 3D motions to the extracted 3D depth-aware features in a coarse-tofine manner. Moreover, we present a new dataset, Talking-UB, which includes upper-body images with high-resolution faces, addressing the limitations of prior datasets that either consist of only facial images or upper-body images with blurry faces. Experimental results demonstrate that our proposed method can synthesize high-quality videos that preserve facial details, and achieves superior results compared to state-of-the-art cross-person motion transfer approaches. Code and collected dataset are released in https://github.com/cooluoluo/PTUS.



Paperid:382
Authors:Matthieu Lin, Jenny Sheng, Yubin Hu, Yangguang Li, Lu Qi, Andrew Zhao, Gao Huang, Yong-Jin Liu
BNRist, Department of Computer Science and Technology, Tsinghua University, BNRist, Department of Computer Science and Technology, Tsinghua University, BNRist, Department of Computer Science and Technology, Tsinghua University, SenseTime Group Limited, The University of California, Merced, BNRist, Department of Automation, Tsinghua University, BNRist, Department of Automation, Tsinghua University, BNRist, Department of Computer Science and Technology, Tsinghua University
Abstract:
This paper tackles the problem of efficient and stable video semantic segmentation. While stability has been underexplored, prevalent work in efficient video semantic segmentation uses the keyframe paradigm. They efficiently process videos by only recomputing the low-level features and reusing high-level features computed at selected keyframes. In addition, the reused features stabilize the predictions across frames, thereby improving video consistency. However, dynamic scenes in the video can easily lead to misalignments between reused and recomputed features, which hampers performance. Moreover, relying on feature reuse to improve prediction consistency is brittle; an erroneous alignment of the features can easily lead to unstable predictions. Therefore, the keyframe paradigm exhibits a dilemma between stability and performance. We address this efficiency and stability challenge using a novel yet simple Temporal Feature Correlation (TFC) module. It uses the cosine similarity between two frames’ low-level features to inform the semantic label’s consistency across frames. Specifically, we selectively reuse label-consistent features across frames through linear interpolation and update others through sparse multi-scale deformable attention. As a result, we no longer directly reuse features to improve stability and thus effectively solve feature misalignment. This work provides a significant step towards efficient and stable video semantic segmentation. On the VSPW dataset, our method significantly improves the prediction consistency of image-based methods while being as fast and accurate.



Paperid:383
Authors:Qinliang Lin, Cheng Luo, Zenghao Niu, Xilin He, Weicheng Xie, Yuanbo Hou, Linlin Shen, Siyang Song
Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University Shenzhen Institute of Artificial Intelligence and Robotics for Society Guangdong Key Laboratory of Intelligent Information Processing, WAVES Research Group, Ghent University, Belgium, Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University Shenzhen Institute of Artificial Intelligence and Robotics for Society Guangdong Key Laboratory of Intelligent Information Processing, University of Leicester, UK
Abstract:
Adversarial examples generated by a surrogate model typically exhibit limited transferability to unknown target systems. To address this problem, many transferability enhancement approaches (e.g., input transformation and model augmentation) have been proposed. However, they show poor performances in attacking systems having different model genera from the surrogate model. In this paper, we propose a novel and generic attacking strategy, called DeformationConstrained Warping Attack (DeCoWA), that can be effectively applied to cross model genus attack. Specifically, DeCoWA firstly augments input examples via an elastic deformation, namely Deformation-Constrained Warping (DeCoW), to obtain rich local details of the augmented input. To avoid severe distortion of global semantics led by random deformation, DeCoW further constrains the strength and direction of the warping transformation by a novel adaptive control strategy. Extensive experiments demonstrate that the transferable examples crafted by our DeCoWA on CNN surrogates can significantly hinder the performance of Transformers (and vice versa) on various tasks, including image classification, video action recognition, and audio recognition. Code is made available at https://github.com/LinQinLiang/DeCoWA.



Paperid:384
Authors:Wei Lin, Antoni B. Chan
City University of Hong Kong, City University of Hong Kong, Hong, Kong
Abstract:
Existing classagnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks without requiring training. These masks are then integrated into a class-agnostic counting methodology for predicting density maps. Furthermore, we introduce a fixed-point inference along with an associated loss function to improve counting accuracy, all without introducing new parameters. The effectiveness of this method is substantiated both theoretically and experimentally. Additionally, a contrastive training scheme is implemented to mitigate dataset bias inherent in current class-agnostic counting datasets, a strategy whose effectiveness is confirmed by our ablation study. Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.



Paperid:385
Authors:Weiping Lin, Zhenfeng Zhuang, Lequan Yu, Liansheng Wang
Xiamen University, Xiamen University, The University of Hong Kong, Xiamen University
Abstract:
Multiple instance learning is an effective paradigm for whole slide image (WSI) classification, where labels are only provided at the bag level. However, instancelevel prediction is also crucial as it offers insights into fine-grained regions of interest. Existing multiple instance learning methods either solely focus on training a bag classifier or have the insufficient capability of exploring instance prediction. In this work, we propose a novel model-agnostic framework to boost existing multiple instance learning models, to improve the WSI classification performance in both bag and instance levels. Specifically, we propose a counterfactual inference-based sub-bag assessment method and a hierarchical instance searching strategy to help to search reliable instances and obtain their accurate pseudo labels. Furthermore, an instance classifier is well-trained to produce accurate predictions. The instance embedding it generates is treated as a prompt to refine the instance feature for bag prediction. This framework is model-agnostic, capable of adapting to existing multiple instance learning models, including those without specific mechanisms like attention. Extensive experiments on three datasets demonstrate the competitive performance of our method. Code will be available at https://github.com/centurion-crawler/CIMIL.



Paperid:386
Authors:Wenbin Lin, Chengwei Zheng, Jun-Hai Yong, Feng Xu
School of Software and BNRist, Tsinghua University, School of Software and BNRist, Tsinghua University, School of Software and BNRist, Tsinghua University, School of Software and BNRist, Tsinghua University
Abstract:
Lightweight creation of 3D digital avatars is a highly desirable but challenging task. With only sparse videos of a person under unknown illumination, we propose a method to create relightable and animatable neural avatars, which can be used to synthesize photorealistic images of humans under novel viewpoints, body poses, and lighting. The key challenge here is to disentangle the geometry, material of the clothed body, and lighting, which becomes more difficult due to the complex geometry and shadow changes caused by body motions. To solve this illposed problem, we propose novel techniques to better model the geometry and shadow changes. For geometry change modeling, we propose an invertible deformation field, which helps to solve the inverse skinning problem and leads to better geometry quality. To model the spatial and temporal varying shading cues, we propose a pose-aware part-wise light visibility network to estimate light occlusion. Extensive experiments on synthetic and real datasets show that our approach reconstructs high-quality geometry and generates realistic shadows under different body poses. Code and data are available at https://wenbin-lin.github.io/RelightableAvatar-page.



Paperid:387
Authors:Xin Lin, Chong Shi, Yibing Zhan, Zuopeng Yang, Yaqi Wu, Dacheng Tao
Guangzhou University, Guangzhou University, JD Explore Academy, Guangzhou University, Guangzhou University, The University of Sydney
Abstract:
Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a longtailed pattern. To address the above problems, in this paper, we introduce a network named TD2-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD2-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD2-Net outperforms the second-best competitors by 12.7% on mean-Recall@10 for predicate classification.



Paperid:388
Authors:Youtian Lin
Nanjing University Harbin Institute of Technology
Abstract:
Rendering photorealistic dynamic scenes has been a focus of recent research, with applications in virtual and augmented reality. While the Neural Radiance Field (NeRF) has shown remarkable rendering quality for static scenes, achieving realtime rendering of dynamic scenes remains challenging due to expansive computation for the time dimension. The incorporation of explicit-based methods, specifically voxel grids, has been proposed to accelerate the training and rendering of neural radiance fields with hybrid representation. However, employing a hybrid representation for dynamic scenes results in overfitting due to fast convergence, which can result in artifacts (e.g., floaters, noisy geometric) on novel views. To address this, we propose a compact and efficient method for dynamic neural radiance fields, namely Ced-NeRF which only require a small number of additional parameters to construct a hybrid representation of dynamic NeRF. Evaluation of dynamic scene datasets shows that our Ced-NeRF achieves fast rendering speeds while maintaining high-quality rendering results. Our method outperforms the current state-of-the-art methods in terms of quality, training and rendering speed.



Paperid:389
Authors:Yuqi Lin, Minghao Chen, Kaipeng Zhang, Hengjia Li, Mingming Li, Zheng Yang, Dongqin Lv, Binbin Lin, Haifeng Liu, Deng Cai
Zhejiang University Shanghai AI Laboratory, Hangzhou Dianzi University, Shanghai AI Laboratory, Zhejiang University, Zhejiang University, FABU Inc., Nantong Port Group, Zhejiang University, Zhejiang University, Zhejiang University FABU Inc.
Abstract:
Contrastive LanguageImage Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.



Paperid:390
Authors:Zhiwei Lin, Yongtao Wang, Shengxiang Qi, Nan Dong, Ming-Hsuan Yang
Peking University, Peking University, Chongqing Changan Automobile Co., Ltd, Chongqing Changan Automobile Co., Ltd, University of California at Merced
Abstract:
Existing LiDARbased 3D object detection methods for autonomous driving scenarios mainly adopt the training-from-scratch paradigm. Unfortunately, this paradigm heavily relies on large-scale labeled data, whose collection can be expensive and time-consuming. Self-supervised pre-training is an effective and desirable way to alleviate this dependence on extensive annotated data. In this work, we present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving. Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation in a BEV perspective and avoid complex decoder design during pre-training. Furthermore, we introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder with fine-tuning for masked point cloud inputs. Based on the property of outdoor point clouds in autonomous driving scenarios, i.e., the point clouds of distant objects are more sparse, we propose point density prediction to enable the 3D encoder to learn location information, which is essential for object detection. Experimental results show that BEV-MAE surpasses prior state-of-the-art self-supervised methods and achieves a favorably pre-training efficiency. Furthermore, based on TransFusion-L, BEV-MAE achieves new state-of-the-art LiDAR-based 3D object detection results, with 73.6 NDS and 69.6 mAP on the nuScenes benchmark. The source code will be released at https://github.com/VDIGPKU/BEV-MAE.



Paperid:391
Authors:Bo Liu, Bin Hu, Xiuli Bi, Weisheng Li, Bin Xiao
Chongqing University of Posts and Telecommunications, Chongqing University of Posts and Telecommunications, Chongqing University of Posts and Telecommunications, Chongqing University of Posts and Telecommunications, Chongqing University of Posts and Telecommunications
Abstract:
Focus stacking is a technique in computational photography, and it synthesizes a single allin-focus image from different focal plane images. It is difficult for previous works to produce a high-quality all-in-focus image that meets two goals: high-fidelity to its source images and good visual effects without defects or abnormalities. This paper proposes a novel method based on optical imaging process analysis and modeling. Based on a foreground segmentation - diffusion elimination architecture, the foreground segmentation makes most of the areas in full-focus images heritage information from the source images to achieve high fidelity; diffusion elimination models the physical imaging process and is specially used to solve the transition region (TR) problem that is a long-term neglected issue and degrades visual effects of synthesized images. Based on extensive experiments on simulated dataset, existing realistic dataset and our proposed BetaFusion dataset, the results show that our proposed method can generate high-quality all-in-focus images by achieving two goals simultaneously, especially can successfully solve the TR problem and eliminate the visual effect degradation of synthesized images caused by the TR problem.



Paperid:392
Authors:Chao Liu, Ting Zhao, Nenggan Zheng
Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China College of Computer Science and Techology, Zhejiang University, Hangzhou, Zhejiang, China, Independent Researcher, Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China College of Computer Science and Techology, Zhejiang University, Hangzhou, Zhejiang, China State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, Zhejiang, China CCAI by MOE and Zhejiang Provincial Government(ZJU), Hangzhou, Zhejiang, China
Abstract:
Curvilinear structures, which include linelike continuous objects, are fundamental geometrical elements in image-based applications. Reconstructing these structures from images constitutes a pivotal research area in computer vision. However, the complex topology and ambiguous image evidence render this process a challenging task. In this paper, we introduce DeepBranchTracer, a novel method that learns both external image features and internal geometric characteristics to reconstruct curvilinear structures. Firstly, we formulate the curvilinear structures extraction as a geometric attribute estimation problem. Then, a curvilinear structure feature learning network is designed to extract essential branch attributes, including the image features of centerline and boundary, and the geometric features of direction and radius. Finally, utilizing a multi-feature fusion tracing strategy, our model iteratively traces the entire branch by integrating the extracted image and geometric features. We extensively evaluated our model on both 2D and 3D datasets, demonstrating its superior performance over existing segmentation and reconstruction methods in terms of accuracy and continuity.



Paperid:393
Authors:Chengxu Liu, Xuan Wang, Yuanting Fan, Shuai Li, Xueming Qian
Xi'an Jiaotong University Shaanxi Yulan Jiuzhou Intelligent Optoelectronic Technology Co., Ltd, MEGVII Technology, Xi'an Jiaotong University, MEGVII Technology, Xi'an Jiaotong University Shaanxi Yulan Jiuzhou Intelligent Optoelectronic Technology Co., Ltd
Abstract:
Underdisplay camera (UDC) systems are the foundation of full-screen display devices in which the lens mounts under the display. The pixel array of light-emitting diodes used for display diffracts and attenuates incident light, causing various degradations as the light intensity changes. Unlike general video restoration which recovers video by treating different degradation factors equally, video restoration for UDC systems is more challenging that concerns removing diverse degradation over time while preserving temporal consistency. In this paper, we introduce a novel video restoration network, called D2RNet, specifically designed for UDC systems. It employs a set of Decoupling Attention Modules (DAM) that effectively separate the various video degradation factors. More specifically, a soft mask generation function is proposed to formulate each frame into flare and haze based on the diffraction arising from incident light of different intensities, followed by the proposed flare and haze removal components that leverage long- and short-term feature learning to handle the respective degradations. Such a design offers an targeted and effective solution to eliminating various types of degradation in UDC systems. We further extend our design into multi-scale to overcome the scale-changing of degradation that often occur in long-range videos. To demonstrate the superiority of D2RNet, we propose a large-scale UDC video benchmark by gathering HDR videos and generating realistically degraded videos using the point spread function measured by a commercial UDC system. Extensive quantitative and qualitative evaluations demonstrate the superiority of D2RNet compared to other state-of-the-art video restoration and UDC image restoration methods.



Paperid:394
Authors:Daizong Liu, Xiang Fang, Xiaoye Qu, Jianfeng Dong, He Yan, Yang Yang, Pan Zhou, Yu Cheng
Wangxuan Institute of Computer Technology, Peking University, Nanyang Technological University, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science of Technology, College of Computer Science and Technology, Zhejiang Gongshang University, Protagolabs Inc., Meta Platforms Inc., Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science of Technology, Department of Computer Science and Engineering, The Chinese University of Hong Kong
Abstract:
Temporal sentence localization (TSL) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant yet expensive manual annotations for training. Moreover, these trained datadependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, in this paper, we target another more practical but challenging setting: unsupervised domain adaptative temporal sentence localization (UDA-TSL), which explores whether the localization knowledge can be transferred from a fully-annotated data domain (source domain) to a new unannotated data domain (target domain). Particularly, we propose an effective and novel baseline for UDA-TSL to bridge the multi-modal gap across different domains and learn the potential correspondence between the video-query pairs in target domain. We first develop separate modality-specific domain adaptation modules to smoothly balance the minimization of the domain shifts in cross-dataset video and query domains. Then, to fully exploit the semantic correspondence of both modalities in target domain for unsupervised localization, we devise a mutual information learning module to adaptively align the video-query pairs which are more likely to be relevant in target domain, leading to more truly aligned target pairs and ensuring the discriminability of target features. In this way, our model can learn domain-invariant and semantic-aligned cross-modal representations. Three sets of migration experiments show that our model achieves competitive performance compared to existing methods.



Paperid:395
Authors:Daizong Liu, Wei Hu
Wangxuan Institute of Computer Technology, Peking University, Wangxuan Institute of Computer Technology, Peking University
Abstract:
Deep learning models for point clouds have shown to be vulnerable to adversarial attacks, which have received increasing attention in various safetycritical applications such as autonomous driving, robotics, and surveillance. Existing 3D attack methods generally employ global distance losses to implicitly constrain the point-wise perturbations for optimization. However, these simple losses are quite difficult to accurately measure and restrict the proper 3D geometry as point clouds are highly structured. Although few recent works try to exploit additional shape-aware surface knowledge to globally constrain the point position, they still fail to preserve the detailed point-to-point geometric dependency in different local regions. To this end, in this paper, we propose a novel Multi-grained Geometry-aware Attack (MGA), which explicitly captures the local topology characteristics in different 3D regions for adversarial constraint. Specifically, we first develop multi-scale spectral local filter banks adapting to different 3D object shapes to explore potential geometric structures in local regions. Considering that objects may contain complex geometries, we then extend each filter bank into multi-layer ones to gradually capture the topology contexts of the same region in a coarse-to-fine manner. Hence, the focused local geometric structures will be highlighted in the coefficients calculated by the filtering process. At last, by restricting these coefficients between benign and adversarial samples, our MGA is able to properly measure and preserve the detailed geometry contexts in the whole 3D object with trivial perturbations. Extensive experiments demonstrate that our attack can achieve superior performance on various 3D classification models, with satisfying adversarial imperceptibility and strong resistance to different defense methods.



Paperid:396
Authors:Decheng Liu, Xijun Wang, Chunlei Peng, Nannan Wang, Ruimin Hu, Xinbo Gao
School of Cyber Engineering, Xidian University, Xi’an, China Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, China, School of Artifical Intelligence, Xidian University, Xi’an, China, School of Cyber Engineering, Xidian University, Xi’an, China Key Laboratory of Artificial Intelligence, Ministry of Education, Shanghai, China, School of Telecommunications Engineering, Xidian University, Xi’an, China, Hangzhou Institute of Technology, Xidian University, Xi’an, China, Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China
Abstract:
Adversarial attacks involve adding perturbations to the source image to cause misclassification by the target model, which demonstrates the potential of attacking face recognition models. Existing adversarial face image generation methods still can’t achieve satisfactory performance because of low transferability and high detectability. In this paper, we propose a unified framework AdvDiffusion that can generate imperceptible adversarial identity perturbations in the latent space but not the raw pixel space, which utilizes strong inpainting capabilities of the latent diffusion model to generate realistic adversarial images. Specifically, we propose the identity-sensitive conditioned diffusion generative model to generate semantic perturbations in the surroundings. The designed adaptive strength-based adversarial perturbation algorithm can ensure both attack transferability and stealthiness. Extensive qualitative and quantitative experiments on the public FFHQ and CelebA-HQ datasets prove the proposed method achieves superior performance compared with the state-of-the-art methods without an extra generative model training process. The source code is available at https://github.com/kopper-xdu/Adv-Diffusion.



Paperid:397
Authors:Fang Liu, Yuhao Liu, Jiaying Lin, Ke Xu, Rynson W.H. Lau
City University of Hong Kong, City University of Hong Kong, City University of Hong Kong, City University of Hong Kong, City University of Hong Kong
Abstract:
Recent research has shown significant interest in imagebased glass surface detection (GSD). However, detecting glass surfaces in dynamic scenes remains largely unexplored due to the lack of a high-quality dataset and an effective video glass surface detection (VGSD) method. In this paper, we propose the first VGSD approach. Our key observation is that reflections frequently appear on glass surfaces, but they change dynamically as the camera moves. Based on this observation, we propose to offset the excessive dependence on a single uncertainty reflection via joint modeling of temporal and spatial reflection cues. To this end, we propose the VGSD-Net with two novel modules: a Location-aware Reflection Extraction (LRE) module and a Context-enhanced Reflection Integration (CRI) module, for the position-aware reflection feature extraction and the spatial-temporal reflection cues integration, respectively. We have also created the first large-scale video glass surface dataset (VGSD-D), consisting of 19,166 image frames with accurately-annotated glass masks extracted from 297 videos. Extensive experiments demonstrate that VGSD-Net outperforms state-of-the-art approaches adapted from related fields. Code and dataset will be available at https://github.com/fawnliu/VGSD.



Paperid:398
Authors:Hao Liu, Xin Li, Mingming Gong, Bing Liu, Yunfei Wu, Deqiang Jiang, Yinsong Liu, Xing Sun
Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab
Abstract:
Recently, Table Structure Recognition (TSR) task, aiming at identifying table structure into machine readable formats, has received increasing interest in the community. While impressive success, most single table componentbased methods can not perform well on unregularized table cases distracted by not only complicated inner structure but also exterior capture distortion. In this paper, we raise it as Complex TSR problem, where the performance degeneration of existing methods is attributable to their inefficient component usage and redundant post-processing. To mitigate it, we shift our perspective from table component extraction towards the efficient multiple components leverage, which awaits further exploration in the field. Specifically, we propose a seminal method, termed GrabTab, equipped with newly proposed Component Deliberator, to handle various types of tables in a unified framework. Thanks to its progressive deliberation mechanism, our GrabTab can flexibly accommodate to most complex tables with reasonable components selected but without complicated post-processing involved. Quantitative experimental results on public benchmarks demonstrate that our method significantly outperforms the state-of-the-arts, especially under more challenging scenes.



Paperid:399
Authors:Haoran Liu, Ying Ma, Ming Yan, Yingke Chen, Dezhong Peng, Xu Wang
College of Computer Science, Sichuan University, Chengdu, China National Innovation Center for UHD Video Technology, Chengdu, China, Faculty of Computing, Harbin Institute of Technology, Harbin, China, Centre for Frontier AI Research (CFAR), A*STAR, Singapore, Department of Computer and Information Sciences, Northumbria University, UK, College of Computer Science, Sichuan University, Chengdu, China National Innovation Center for UHD Video Technology, Chengdu, China, College of Computer Science, Sichuan University, Chengdu, China
Abstract:
Driven by generative AI and the Internet, there is an increasing availability of a wide variety of images, leading to the significant and popular task of crossdomain image retrieval. To reduce annotation costs and increase performance, this paper focuses on an untouched but challenging problem, i.e., cross-domain image retrieval with partial labels (PCIR). Specifically, PCIR faces great challenges due to the ambiguous supervision signal and the domain gap. To address these challenges, we propose a novel method called disambiguated domain alignment (DiDA) for cross-domain retrieval with partial labels. In detail, DiDA elaborates a novel prototype-score unitization learning mechanism (PSUL) to extract common discriminative representations by simultaneously disambiguating the partial labels and narrowing the domain gap. Additionally, DiDA proposes a prototype-based domain alignment mechanism (PBDA) to further bridge the inherent cross-domain discrepancy. Attributed to PSUL and PBDA, our DiDA effectively excavates domain-invariant discrimination for cross-domain image retrieval. We demonstrate the effectiveness of DiDA through comprehensive experiments on three benchmarks, comparing it to existing state-of-the-art methods. Code available: https://github.com/lhrrrrrr/DiDA.



Paperid:400
Authors:Huan Liu, Julia Qi, Zhenhao Li, Mohammad Hassanpour, Yang Wang, Konstantinos N. Plataniotis, Yuanhao Yu
Huawei Noah's Ark Laboratory, Huawei Noah's Ark Laboratory University of Waterloo, Huawei Noah's Ark Laboratory, Huawei Noah's Ark Laboratory, Concordia University, University of Toronto, Huawei Noah's Ark Laboratory
Abstract:
Despite the recent remarkable achievement in gaze estimation, efficient and accurate personalization of gaze estimation without labels is a practical problem but rarely touched on in the literature. To achieve efficient personalization, we take inspiration from the recent advances in Natural Language Processing (NLP) by updating a negligible number of parameters, "prompts", at the test time. Specifically, the prompt is additionally attached without perturbing original network and can contain less than 1% of a ResNet18's parameters. Our experiments show high efficiency of the prompt tuning approach. The proposed one can be 10 times faster in terms of adaptation speed than the methods compared. However, it is non-trivial to update the prompt for personalized gaze estimation without labels. At the test time, it is essential to ensure that the minimizing of particular unsupervised loss leads to the goals of minimizing gaze estimation error. To address this difficulty, we propose to meta-learn the prompt to ensure that its updates align with the goal. Our experiments show that the meta-learned prompt can be effectively adapted even with a simple symmetry loss. In addition, we experiment on four cross-dataset validations to show the remarkable advantages of the proposed method.



Paperid:401
Authors:Jiaming Liu, Yue Wu, Maoguo Gong, Qiguang Miao, Wenping Ma, Cai Xu, Can Qin
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Northeastern University
Abstract:
3D Single Object Tracking (SOT) stands a forefront task of computer vision, proving essential for applications like autonomous driving. Sparse and occluded data in scene point clouds introduce variations in the appearance of tracked objects, adding complexity to the task. In this research, we unveil M3SOT, a novel 3D SOT framework, which synergizes multiple input frames (template sets), multiple receptive fields (continuous contexts), and multiple solution spaces (distinct tasks) in ONE model. Remarkably, M3SOT pioneers in modeling temporality, contexts, and tasks directly from point clouds, revisiting a perspective on the key factors influencing SOT. To this end, we design a transformerbased network centered on point cloud targets in the search area, aggregating diverse contextual representations and propagating target cues by employing historical frames. As M3SOT spans varied processing perspectives, we've streamlined the network—trimming its depth and optimizing its structure—to ensure a lightweight and efficient deployment for SOT applications. We posit that, backed by practical construction, M3SOT sidesteps the need for complex frameworks and auxiliary components to deliver sterling results. Extensive experiments on benchmarks such as KITTI, nuScenes, and Waymo Open Dataset demonstrate that M3SOT achieves state-of-the-art performance at 38 FPS. Our code and models are available at https://github.com/ywu0912/TeamCode.git.



Paperid:402
Authors:Jiaqi Liu, Kai Wu, Qiang Nie, Ying Chen, Bin-Bin Gao, Yong Liu, Jinbao Wang, Chengjie Wang, Feng Zheng
Southern University of Science and Technology, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Southern University of Science and Technology, Tencent Youtu Lab Shanghai Jiao Tong University, Southern University of Science and Technology
Abstract:
Unsupervised Anomaly Detection (UAD) with incremental training is crucial in industrial manufacturing, as unpredictable defects make obtaining sufficient labeled data infeasible. However, continual learning methods primarily rely on supervised annotations, while the application in UAD is limited due to the absence of supervision. Current UAD methods train separate models for different classes sequentially, leading to catastrophic forgetting and a heavy computational burden. To address this issue, we introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD, which equips the UAD with continual learning capability through contrastivelylearned prompts. In the proposed UCAD, we design a Continual Prompting Module (CPM) by utilizing a concise key-prompt-knowledge memory bank to guide task-invariant 'anomaly' model predictions using task-specific 'normal' knowledge. Moreover, Structure-based Contrastive Learning (SCL) is designed with the Segment Anything Model (SAM) to improve prompt learning and anomaly segmentation results. Specifically, by treating SAM's masks as structure, we draw features within the same mask closer and push others apart for general feature representations. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation, demonstrating that our method is significantly better than anomaly detection methods, even with rehearsal training. The code will be available at https://github.com/shirowalker/UCAD.



Paperid:403
Authors:Jin Liu, Huiyuan Fu, Chuanming Wang, Huadong Ma
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Exposure correction aims to enhance images suffering from improper exposure to achieve satisfactory visual effects. Despite recent progress, existing methods generally mitigate either overexposure or underexposure in input images, and they still struggle to handle images with mixed exposure, i.e., one image incorporates both overexposed and underexposed regions. The mixed exposure distribution is nonuniform and leads to varying representation, which makes it challenging to address in a unified process. In this paper, we introduce an effective Region-aware Exposure Correction Network (RECNet) that can handle mixed exposure by adaptively learning and bridging different regional exposure representations. Specifically, to address the challenge posed by mixed exposure disparities, we develop a region-aware de-exposure module that effectively translates regional features of mixed exposure scenarios into an exposure-invariant feature space. Simultaneously, as de-exposure operation inevitably reduces discriminative information, we introduce a mixed-scale restoration unit that integrates exposure-invariant features and unprocessed features to recover local information. To further achieve a uniform exposure distribution in the global image, we propose an exposure contrastive regularization strategy under the constraints of intra-regional exposure consistency and inter-regional exposure continuity. Extensive experiments are conducted on various datasets, and the experimental results demonstrate the superiority and generalization of our proposed method. The code is released at: https://github.com/kravrolens/RECNet.



Paperid:404
Authors:Jinxiu Liu, Qi Liu
South China University of Technology, South China University of Technology
Abstract:
Image generation tasks have achieved remarkable performance using largescale diffusion models. However, these models are limited to capturing the abstract relations (viz., interactions excluding positional relations) among multiple entities of complex scene graphs. Two main problems exist: 1) fail to depict more concise and accurate interactions via abstract relations; 2) fail to generate complete entities. To address that, we propose a novel Relation-aware Compositional Contrastive Control Diffusion method, dubbed as R3CD, that leverages large-scale diffusion models to learn abstract interactions from scene graphs. Herein, a scene graph transformer based on node and edge encoding is first designed to perceive both local and global information from input scene graphs, whose embeddings are initialized by a T5 model. Then a joint contrastive loss based on attention maps and denoising steps is developed to control the diffusion model to understand and further generate images, whose spatial structures and interaction features are consistent with a priori relation. Extensive experiments are conducted on two datasets: Visual Genome and COCO-Stuff, and demonstrate that the proposal outperforms existing models both in quantitative and qualitative metrics to generate more realistic and diverse images according to different scene graph specifications.



Paperid:405
Authors:Jun Liu, Jiantao Zhou, Jiandian Zeng, Jinyu Tian
State Key Laboratory of Internet of Things for Smart City,Department of Computer and Information Science,University of Macau, State Key Laboratory of Internet of Things for Smart City,Department of Computer and Information Science,University of Macau, Institute of Artificial Intelligence and Future Networks, Beijing Normal University, School of Computer Science and Engineering, Macau University of Science and Technology
Abstract:
This work investigates efficient scorebased black-box adversarial attacks with high Attack Success Rate (ASR) and good generalizability. We design a novel attack method based on a Disentangled Feature space, called DifAttack, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack firstly disentangles an image's latent feature into an adversarial feature and a visual feature, where the former dominates the adversarial capability of an image, while the latter largely determines its visual appearance. We train an autoencoder for the disentanglement by using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods. Eventually, DifAttack iteratively optimizes the adversarial feature according to the query feedback from the victim model until a successful AE is generated, while keeping the visual feature unaltered. In addition, due to the avoidance of using surrogate models' gradient information when optimizing AEs for black-box models, our proposed DifAttack inherently possesses better attack capability in the open-set scenario, where the training dataset of the victim model is unknown. Extensive experimental results demonstrate that our method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code is available The code is available at https://github.com/csjunjun/DifAttack.git.



Paperid:406
Authors:Lijun Liu, Rui Wang, Yuan Wang, Lihua Jing, Chuan Wang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Department of Electronic Engineering,Tsinghua University, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Abstract:
OpenSet Recognition (OSR) aims to accurately identify known classes while effectively rejecting unknown classes to guarantee reliability. Most existing OSR methods focus on learning in the spatial domain, where subtle texture and global structure are potentially intertwined. Empirical studies have shown that DNNs trained in the original spatial domain are inclined to over-perceive subtle texture. The biased semantic perception could lead to catastrophic over-confidence when predicting both known and unknown classes. To this end, we propose an innovative approach by decomposing the spatial domain to the frequency domain to separately consider global (low-frequency) and subtle (high-frequency) information, named Frequency Shuffling and Enhancement (FreSH). To alleviate the overfitting of subtle texture, we introduce the High-Frequency Shuffling (HFS) strategy that generates diverse high-frequency information and promotes the capture of low-frequency invariance. Moreover, to enhance the perception of global structure, we propose the Low-Frequency Residual (LFR) learning procedure that constructs a composite feature space, integrating low-frequency and original spatial features. Experiments on various benchmarks demonstrate that the proposed FreSH consistently trumps the state-of-the-arts by a considerable margin.



Paperid:407
Authors:Liu Liu, Anran Huang, Qi Wu, Dan Guo, Xun Yang, Meng Wang
Hefei University of Technology, Hefei University of Technology, Shanghai Jiao Tong University, Hefei University of Technology, University of Science and Technology of China, Hefei University of Technology
Abstract:
Our life is populated with articulated objects. Current categorylevel articulation estimation works largely focus on predicting part-level 6D poses on static point cloud observations. In this paper, we tackle the problem of category-level online robust and real-time 6D pose tracking of articulated objects, where we propose KPA-Tracker, a novel 3D KeyPoint based Articulated object pose Tracker. Given an RGB-D image or a partial point cloud at the current frame as well as the estimated per-part 6D poses from the last frame, our KPA-Tracker can effectively update the poses with learned 3D keypoints between the adjacent frames. Specifically, we first canonicalize the input point cloud and formulate the pose tracking as an inter-frame pose increment estimation task. To learn consistent and separate 3D keypoints for every rigid part, we build KPA-Gen that outputs the high-quality ordered 3D keypoints in an unsupervised manner. During pose tracking on the whole video, we further propose a keypoint-based articulation tracking algorithm that mines keyframes as reference for accurate pose updating. We provide extensive experiments on validating our KPA-Tracker on various datasets ranging from synthetic point cloud observation to real-world scenarios, which demonstrates the superior performance and robustness of the KPA-Tracker. We believe that our work has the potential to be applied in many fields including robotics, embodied intelligence and augmented reality. All the datasets and codes are available at https://github.com/hhhhhar/KPA-Tracker.



Paperid:408
Authors:Ruicong Liu, Feng Lu
Beihang University, Beihang University
Abstract:
Gaze estimation has become a subject of growing interest in recent research. Most of the current methods rely on singleview facial images as input. Yet, it is hard for these approaches to handle large head angles, leading to potential inaccuracies in the estimation. To address this issue, adding a second-view camera can help better capture eye appearance. However, existing multi-view methods have two limitations. 1) They require multi-view annotations for training, which are expensive. 2) More importantly, during testing, the exact positions of the multiple cameras must be known and match those used in training, which limits the application scenario. To address these challenges, we propose a novel 1-view-to-2-views (1-to-2 views) adaptation solution in this paper, the Unsupervised 1-to-2 Views Adaptation framework for Gaze estimation (UVAGaze). Our method adapts a traditional single-view gaze estimator for flexibly placed dual cameras. Here, the "flexibly" means we place the dual cameras in arbitrary places regardless of the training data, without knowing their extrinsic parameters. Specifically, the UVAGaze builds a dual-view mutual supervision adaptation strategy, which takes advantage of the intrinsic consistency of gaze directions between both views. In this way, our method can not only benefit from common single-view pre-training, but also achieve more advanced dual-view gaze estimation. The experimental results show that a single-view estimator, when adapted for dual views, can achieve much higher accuracy, especially in cross-dataset settings, with a substantial improvement of 47.0%. Project page: https://github.com/MickeyLLG/UVAGaze.



Paperid:409
Authors:Ruixin Liu, Zejian Yuan
Xi'an Jiaotong University, Xi‘an Jiaotong University
Abstract:
Highdefinition (HD) map construction requires a comprehensive understanding of traffic environments, encompassing centimeter-level localization and rich semantic information. Previous works face challenges in redundant point representation or high-complexity curve modeling. In this paper, we present a flexible yet effective map element detector that synthesizes hierarchical information with a compact Douglas-Peucker (DP) point representation in a transformer architecture for robust and reliable predictions. Specifically, our proposed representation approximates class-agnostic map elements with DP points, which are sparsely located in crucial positions of structures and can get rid of redundancy and complexity. Besides, we design a position constraint with uncertainty to avoid potential ambiguities. Moreover, pairwise-point shape matching constraints are proposed to balance local structural information of different scales. Experiments on the public nuScenes dataset demonstrate that our method overwhelms current SOTAs. Extensive ablation studies validate each component of our methods. Codes will be released at https://github.com/sweety121/DPFormer.



Paperid:410
Authors:Siqi Liu, Yong-Lu Li, Zhou Fang, Xinpeng Liu, Yang You, Cewu Lu
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiaotong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Embedding Human and Articulated Object Interaction (HAOI) in 3D is an important direction for a deeper human activity understanding. Different from previous works that use parametric and CAD models to represent humans and objects, in this work, we propose a novel 3D geometric primitivebased language to encode both humans and objects. Given our new paradigm, humans and objects are all compositions of primitives instead of heterogeneous entities. Thus, mutual information learning may be achieved between the limited 3D data of humans and different object categories. Moreover, considering the simplicity of the expression and the richness of the information it contains, we choose the superquadric as the primitive representation. To explore an effective embedding of HAOI for the machine, we build a new benchmark on 3D HAOI consisting of primitives together with their images and propose a task requiring machines to recover 3D HAOI using primitives from images. Moreover, we propose a baseline of single-view 3D reconstruction on HAOI. We believe this primitive-based 3D HAOI representation would pave the way for 3D HAOI studies. Our code and data are available at https://mvig-rhos.com/p3haoi.



Paperid:411
Authors:Wang Liu, Wei Gao, Xingming Mu
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University Peng Cheng Laboratory, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
Abstract:
Recent years have witnessed the success of deep learning methods in quality enhancement of compressed point cloud. However, existing methods focus on geometry and attribute enhancement of singleframe point cloud. This paper proposes a novel compressed quality enhancement method for dynamic point cloud (DAE-MP). Specifically, we propose a fast inter-frame motion prediction module (IFMP) to explicitly estimate motion displacement and achieve inter-frame feature alignment. To maintain motion continuity between consecutive frames, we propose a motion consistency loss for supervised learning. Furthermore, a frequency component separation and fusion module is designed to extract rich frequency features adaptively. To the best of our knowledge, the proposed method is the first deep learning-based work to enhance the quality for compressed dynamic point cloud. Experimental results show that the proposed method can greatly improve the quality of compressed dynamic point cloud and provide a fast and efficient motion prediction plug-in for large-scale point cloud. For dynamic point cloud attribute with severely compressed artifact, our proposed DAE-MP method achieves up to 0.52dB (PSNR) performance gain. Moreover, the proposed IFMP module has a certain real-time processing ability for calculating the motion offset between dynamic point cloud frame.



Paperid:412
Authors:Wensi Liu, Xiao-Yu Tang, Chong Yang, Chunjie Yang
College of Control Science and Engineering, Zhejiang University, College of Control Science and Engineering, Zhejiang University, College of Control Science and Engineering, Zhejiang University, College of Control Science and Engineering, Zhejiang University
Abstract:
Semantic segmentation is one of the tasks concerned in the field of computer vision. However, the cost of capturing large numbers of pixellevel annotations is expensive. Semi-supervised learning can utilize labeled and unlabeled data, providing new ideas for solving the problem of insufficient labeled data. In this work, we propose a data-reliability weighted multi-phase learning method for semi-supervised segmentation (RWMS). Under the framework of self-training, we train two different teacher models to evaluate the reliability of pseudo labels. By selecting reliable data at the image level and reweighting pseudo labels at the pixel level, multi-phase training is guided to focus on more reliable knowledge. Besides, we also inject strong data augmentations on unlabeled images while training. Through extensive experiments, we demonstrate that our method performs remarkably well compared to baseline methods and substantially outperforms them, more than 3% on VOC and Cityscapes.



Paperid:413
Authors:Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chaoyu Feng, Xiaotao Wang, Lei Lei, Wangmeng Zuo
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of technology, Independent Researcher, Independent Researcher, Independent Researcher, Harbin Institute of Technology, China
Abstract:
Realworld image de-weathering aims at removing various undesirable weather-related artifacts. Owing to the impossibility of capturing image pairs concurrently, existing real-world de-weathering datasets often exhibit inconsistent illumination, position, and textures between the ground-truth images and the input degraded images, resulting in imperfect supervision. Such non-ideal supervision negatively affects the training process of learning-based de-weathering methods. In this work, we attempt to address the problem with a unified solution for various inconsistencies. Specifically, inspired by information bottleneck theory, we first develop a Consistent Label Constructor (CLC) to generate a pseudo-label as consistent as possible with the input degraded image while removing most weather-related degradation. In particular, multiple adjacent frames of the current input are also fed into CLC to enhance the pseudo-label. Then we combine the original imperfect labels and pseudo-labels to jointly supervise the de-weathering model by the proposed Information Allocation Strategy (IAS). During testing, only the de-weathering model is used for inference. Experiments on two real-world de-weathering datasets show that our method helps existing de-weathering models achieve better performance. Code is available at https://github.com/1180300419/imperfect-deweathering.



Paperid:414
Authors:Xingyu Liu, Xu Cheng, Haoyu Chen, Hao Yu, Guoying Zhao
Nanjing University of Information Science and Technology, Nanjing University of Information Science and Technology, University of Oulu, Nanjing University of Information Science and Technology, University of Oulu
Abstract:
Sketch reidentification (Re-ID) seeks to match pedestrians' photos from surveillance videos with corresponding sketches. However, we observe that existing works still have two critical limitations: (i) cross- and intra-modality discrepancies hinder the extraction of modality-shared features, (ii) standard triplet loss fails to constrain latent feature distribution in each modality with inadequate samples. To overcome the above issues, we propose a differentiable auxiliary learning network (DALNet) to explore a robust auxiliary modality for Sketch Re-ID. Specifically, for (i) we construct an auxiliary modality by using a dynamic auxiliary generator (DAG) to bridge the gap between sketch and photo modalities. The auxiliary modality highlights the described person in photos to mitigate background clutter and learns sketch style through style refinement. Moreover, a modality interactive attention module (MIA) is presented to align the features and learn the invariant patterns of two modalities by auxiliary modality. To address (ii), we propose a multi-modality collaborative learning scheme (MMCL) to align the latent distribution of three modalities. An intra-modality circle loss in MMCL brings learned global and modality-shared features of the same identity closer in the case of insufficient samples within each modality. Extensive experiments verify the superior performance of our DALNet over the state-of-the-art methods for Sketch Re-ID, and the generalization in sketch-based image retrieval and sketch-photo face recognition tasks.



Paperid:415
Authors:Xingyu Liu, Pengfei Ren, Yuanyuan Gao, Jingyu Wang, Haifeng Sun, Qi Qi, Zirui Zhuang, Jianxin Liao
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Abstract:
Previous 3D hand pose estimation methods primarily rely on a single modality, either RGB or depth, and the comprehensive utilization of the dual modalities has not been extensively explored. RGB and depth data provide complementary information and thus can be fused to enhance the robustness of 3D hand pose estimation. However, there exist two problems for applying existing fusion methods in 3D hand pose estimation: redundancy of dense feature fusion and ambiguity of visual features. First, pixelwise feature interactions introduce high computational costs and ineffective calculations of invalid pixels. Second, visual features suffer from ambiguity due to color and texture similarities, as well as depth holes and noise caused by frequent hand movements, which interferes with modeling cross-modal correlations. In this paper, we propose Keypoint-Fusion for RGB-D based 3D hand pose estimation, which leverages the unique advantages of dual modalities to mutually eliminate the feature ambiguity, and performs cross-modal feature fusion in a more efficient way. Specifically, we focus cross-modal fusion on sparse yet informative spatial regions (i.e. keypoints). Meanwhile, by explicitly extracting relatively more reliable information as disambiguation evidence, depth modality provides 3D geometric information for RGB feature pixels, and RGB modality complements the precise edge information lost due to the depth noise. Keypoint-Fusion achieves state-of-the-art performance on two challenging hand datasets, significantly decreasing the error compared with previous single-modal methods.



Paperid:416
Authors:Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, Anoop Cherian
University of Washington, Seattle, WA, Samsung Research America, Mountain View, CA, Mitsubishi Electric Research Labs, Cambridge, MA, Mitsubishi Electric Research Labs, Cambridge, MA
Abstract:
Audiovisual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that use only uni-directional interaction.



Paperid:417
Authors:Yitian Liu, Zhouhui Lian
Wangxuan Institute of Computer Technology, Peking University, Beijing, P.R. China, Wangxuan Institute of Computer Technology, Peking University, Beijing, P.R. China
Abstract:
Fewshot font generation, especially for Chinese calligraphy fonts, is a challenging and ongoing problem. With the help of prior knowledge that is mainly based on glyph consistency assumptions, some recently proposed methods can synthesize high-quality Chinese glyph images. However, glyphs in calligraphy font styles often do not meet these assumptions. To address this problem, we propose a novel model, DeepCalliFont, for few-shot Chinese calligraphy font synthesis by integrating dual-modality generative models. Specifically, the proposed model consists of image synthesis and sequence generation branches, generating consistent results via a dual-modality representation learning strategy. The two modalities (i.e., glyph images and writing sequences) are properly integrated using a feature recombination module and a rasterization loss function. Furthermore, a new pre-training strategy is adopted to improve the performance by exploiting large amounts of uni-modality data. Both qualitative and quantitative experiments have been conducted to demonstrate the superiority of our method to other state-of-the-art approaches in the task of few-shot Chinese calligraphy font synthesis. The source code can be found at https://github.com/lsflyt-pku/DeepCalliFont.



Paperid:418
Authors:Yixin Liu, Kaidi Xu, Xun Chen, Lichao Sun
Lehigh University, Drexel University, Samsung Research America, Lehigh University
Abstract:
The open sourcing of large amounts of image data promotes the development of deep learning techniques. Along with this comes the privacy risk of these image datasets being exploited by unauthorized third parties to train deep learning models for commercial or illegal purposes. To avoid the abuse of data, a poisoningbased technique, "unlearnable example", has been proposed to significantly degrade the generalization performance of models by adding imperceptible noise to the data. To further enhance its robustness against adversarial training, existing works leverage iterative adversarial training on both the defensive noise and the surrogate model. However, it still remains unknown whether the robustness of unlearnable examples primarily comes from the effect of enhancement in the surrogate model or the defensive noise. Observing that simply removing the adversarial perturbation on the training process of the defensive noise can improve the performance of robust unlearnable examples, we identify that solely the surrogate model's robustness contributes to the performance. Furthermore, we found a negative correlation exists between the robustness of defensive noise and the protection performance, indicating defensive noise's instability issue. Motivated by this, to further boost the robust unlearnable example, we introduce Stable Error-Minimizing noise (SEM), which trains the defensive noise against random perturbation instead of the time-consuming adversarial perturbation to improve the stability of defensive noise. Through comprehensive experiments, we demonstrate that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet Subset regarding both effectiveness and efficiency.



Paperid:419
Authors:Yongxu Liu, Yinghui Quan, Guoyao Xiao, Aobo Li, Jinjian Wu
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Quality assessment of images and videos emphasizes both local details and global semantics, whereas general data sampling methods (e.g., resizing, cropping or gridbased fragment) fail to catch them simultaneously. To address the deficiency, current approaches have to adopt multi-branch models and take as input the multi-resolution data, which burdens the model complexity. In this work, instead of stacking up models, a more elegant data sampling method (named as SAMA, scaling and masking) is explored, which compacts both the local and global content in a regular input size. The basic idea is to scale the data into a pyramid first, and reduce the pyramid into a regular data dimension with a masking strategy. Benefiting from the spatial and temporal redundancy in images and videos, the processed data maintains the multi-scale characteristics with a regular input size, thus can be processed by a single-branch model. We verify the sampling method in image and video quality assessment. Experiments show that our sampling method can improve the performance of current single-branch models significantly, and achieves competitive performance to the multi-branch models without extra model complexity. The source code will be available at https://github.com/Sissuire/SAMA.



Paperid:420
Authors:Yuchun Liu, Benjamin Planche, Meng Zheng, Zhongpai Gao, Pierre Sibut-Bourde, Fan Yang, Terrence Chen, Ziyan Wu
United Imaging Intelligence, United Imaging Intelligence, United Imaging Intelligence, United Imaging Intelligence, United Imaging Intelligence, United Imaging Intelligence, United Imaging Intelligence, United Imaging Intelligence
Abstract:
Deep implicit functions (DIFs) have emerged as a potent and articulate means of representing 3D shapes. However, methods modeling object categories or nonrigid entities have mainly focused on single-object scenarios. In this work, we propose MODIF, a multi-object deep implicit function that jointly learns the deformation fields and instance-specific latent codes for multiple objects at once. Our emphasis is on non-rigid, non-interpenetrating entities such as organs. To effectively capture the interrelation between these entities and ensure precise, collision-free representations, our approach facilitates signaling between category-specific fields to adequately rectify shapes. We also introduce novel inter-object supervision: an attraction-repulsion loss is formulated to refine contact regions between objects. Our approach is demonstrated on various medical benchmarks, involving modeling different groups of intricate anatomical entities. Experimental results illustrate that our model can proficiently learn the shape representation of each organ and their relations to others, to the point that shapes missing from unseen instances can be consistently recovered by our method. Finally, MODIF can also propagate semantic information throughout the population via accurate point correspondences.



Paperid:421
Authors:Yuhao Liu, Zhanghan Ke, Ke Xu, Fang Liu, Zhenwei Wang, Rynson W.H. Lau
City University of Hong Kong, City University of Hong Kong, City University of Hong Kong, City University of Hong Kong, City University of Hong Kong, City University of Hong Kong
Abstract:
Removing shadows requires an understanding of both lighting conditions and object textures in a scene. Existing methods typically learn pixellevel color mappings between shadow and non-shadow images, in which the joint modeling of lighting and object textures is implicit and inadequate. We observe that in a shadow region, the degradation degree of object textures depends on the local illumination, while simply enhancing the local illumination cannot fully recover the attenuated textures. Based on this observation, we propose to condition the restoration of attenuated textures on the corrected local lighting in the shadow region. Specifically, We first design a shadow-aware decomposition network to estimate the illumination and reflectance layers of shadow regions explicitly. We then propose a novel bilateral correction network to recast the lighting of shadow regions in the illumination layer via a novel local lighting correction module, and to restore the textures conditioned on the corrected illumination layer via a novel illumination-guided texture restoration module. We further annotate pixel-wise shadow masks for the public SRD dataset, which originally contains only image pairs. Experiments on three benchmarks show that our method outperforms existing state-of-the-art shadow removal methods. Project page in: yuhaoliu7456.github.io/RRL-Net.



Paperid:422
Authors:Yutong Liu, Haijiang Zhu, Mengting Liu, Huaiyuan Yu, Zihan Chen, Jie Gao
Beijing University of Chemical Technology, Beijing University of Chemical Technology, Beijing University of Chemical Technology, Beijing University of Chemical Technology, Beijing University of Chemical Technology, Beijing University of Chemical Technology
Abstract:
Medical image segmentation methods based on deep learning network are mainly divided into CNN and Transformer. However, CNN struggles to capture longdistance dependencies, while Transformer suffers from high computational complexity and poor local feature learning. To efficiently extract and fuse local features and long-range dependencies, this paper proposes Rolling-Unet, which is a CNN model combined with MLP. Specifically, we propose the core R-MLP module, which is responsible for learning the long-distance dependency in a single direction of the whole image. By controlling and combining R-MLP modules in different directions, OR-MLP and DOR-MLP modules are formed to capture long-distance dependencies in multiple directions. Further, Lo2 block is proposed to encode both local context information and long-distance dependencies without excessive computational burden. Lo2 block has the same parameter size and computational complexity as a 3×3 convolution. The experimental results on four public datasets show that Rolling-Unet achieves superior performance compared to the state-of-the-art methods.



Paperid:423
Authors:Yuxuan Liu, Haizhou Ai, Junliang Xing, Xuri Li, Xiaoyi Wang, Pin Tao
Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, Beijing University of Technology, Beijing 100124, China, Independent Researcher, Haidian District, Beijing, China, Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Abstract:
Multiple views play a vital role in 3D pose estimation tasks. Ideally, multiview 3D pose estimation tasks should directly utilize naturally collected videos for pose estimation. However, due to the constraints of video synchronization, existing methods often use expensive hardware devices to synchronize the initiation of cameras, which restricts most 3D pose collection scenarios to indoor settings. Some recent works learn deep neural networks to align desynchronized datasets derived from synchronized cameras and can only produce frame-level accuracy. For fractional frame video synchronization, this work proposes an Inter-Frame and Intra-Frame Desynchronized Dataset (IFID), which labels fractional time intervals between two video clips. IFID is the first dataset that annotates inter-frame and intra-frame intervals, with a total of 382,500 video clips annotated, making it the largest dataset to date. We also develop a novel model based on the Transformer architecture, named InSynFormer, for synchronizing inter-frame and intra-frame. Extensive experimental evaluations demonstrate its promising performance. The dataset and source code of the model are available at https://github.com/yuxuan-cser/InSynFormer.



Paperid:424
Authors:Yuzhi Liu, Huisi Wu, Jing Qin
Shenzhen University, Shenzhen University, The Hong Kong Polytechnic University
Abstract:
Recent advancements in deep learning have greatly improved the efficiency of auxiliary medical diagnostics. However, concerns over patient privacy and data annotation costs restrict the viability of centralized training models. In response, federated semisupervised learning has garnered substantial attention from medical institutions. However, it faces challenges arising from knowledge discrepancies among local clients and class imbalance in non-independent and identically distributed data. Existing methods like class balance adaptation for addressing class imbalance often overlook low-confidence yet valuable rare samples in unlabeled data and may compromise client privacy. To address these issues, we propose a novel framework with class awareness balance and dual teacher distillation called FedCD. FedCD introduces a global-local framework to balance and purify global and local knowledge. Additionally, we introduce a novel class awareness balance module to effectively explore potential rare classes and encourage balanced learning in unlabeled clients. Importantly, our approach prioritizes privacy protection by only exchanging network parameters during communication. Experimental results on two medical datasets under various settings demonstrate the effectiveness of FedCD. The code is available at https://github.com/YunzZ-Liu/FedCD.



Paperid:425
Authors:Zhaochen Liu, Zhixuan Li, Tingting Jiang
National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University AI Innovation Center, School of Computer Science, Peking University, School of Computer Science and Engineering, Nanyang Technological University, National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Biomedical Imaging Center, Peking University
Abstract:
Perceiving the complete shape of occluded objects is essential for human and machine intelligence. While the amodal segmentation task is to predict the complete mask of partially occluded objects, it is timeconsuming and labor-intensive to annotate the pixel-level ground truth amodal masks. Box-level supervised amodal segmentation addresses this challenge by relying solely on ground truth bounding boxes and instance classes as supervision, thereby alleviating the need for exhaustive pixel-level annotations. Nevertheless, current box-level methodologies encounter limitations in generating low-resolution masks and imprecise boundaries, failing to meet the demands of practical real-world applications. We present a novel solution to tackle this problem by introducing a directed expansion approach from visible masks to corresponding amodal masks. Our approach involves a hybrid end-to-end network based on the overlapping region - the area where different instances intersect. Diverse segmentation strategies are applied for overlapping regions and non-overlapping regions according to distinct characteristics. To guide the expansion of visible masks, we introduce an elaborately-designed connectivity loss for overlapping regions, which leverages correlations with visible masks and facilitates accurate amodal segmentation. Experiments are conducted on several challenging datasets and the results show that our proposed method can outperform existing state-of-the-art methods with large margins.



Paperid:426
Authors:Zhihang Liu, Jun Li, Hongtao Xie, Pandeng Li, Jiannan Ge, Sun-Ao Liu, Guoqing Jin
University of Science and Technology of China, People's Daily Online, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, People's Daily Online
Abstract:
Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing crossmodal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, i.e., the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced video contains less redundant semantics and is more balanced with the textual modality. Second, we enhance the textual modality at the segment-sentence level by learning complementary knowledge from context sentences and ground-truth segments. With the knowledge added to the query, the textual modality thus maintains more meaningful semantics and is more balanced with the video modality. By implementing two levels of MESM, the semantic information from both modalities is more balanced to align, thereby bridging the modality gap. Experiments on three widely used benchmarks, including the out-of-distribution settings, show that the proposed framework achieves a new start-of-the-art performance with notable generalization ability (e.g., 4.42% and 7.69% average gains of R1@0.7 on Charades-STA and Charades-CG). The code will be available at https://github.com/lntzm/MESM.



Paperid:427
Authors:Zhiyue Liu, Jinyuan Liu, Fanrong Ma
Guangxi University, Guangxi University, Guangxi University
Abstract:
Although image captioning models have made significant advancements in recent years, the majority of them heavily depend on highquality datasets containing paired images and texts which are costly to acquire. Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. However, not only does a modality gap exist between CLIP text and image features, but a discrepancy also arises between training and inference due to the unavailability of real-world images, which hinders the cross-modal alignment in text-only captioning. This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs. A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space. Furthermore, textual information is gathered to represent image features, resulting in the image features with various semantics and the bridged modality gap. To unify training and inference, synthetic image features would serve as the training prefix for the language decoder, while real images are used for inference. Additionally, salient objects in images are detected as assistance to enhance the learning of modality alignment. Experimental results demonstrate that our method obtains the state-of-the-art performance on benchmark datasets.



Paperid:428
Authors:Wei Lou, Guanbin Li, Xiang Wan, Haofeng Li
Shenzhen Research Institute of Big Data, Shenzhen, China The Chinese University of Hong Kong, Shenzhen, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China GuangDong Province Key Laboratory of Information Security Technology, Shenzhen Research Institute of Big Data, Shenzhen, China, Shenzhen Research Institute of Big Data, Shenzhen, China
Abstract:
Nuclei classification is a critical step in computeraided diagnosis with histopathology images. In the past, various methods have employed graph neural networks (GNN) to analyze cell graphs that model inter-cell relationships by considering nuclei as vertices. However, they are limited by the GNN mechanism that only passes messages among local nodes via fixed edges. To address the issue, we develop a cell graph transformer (CGT) that treats nodes and edges as input tokens to enable learnable adjacency and information exchange among all nodes. Nevertheless, training the transformer with a cell graph presents another challenge. Poorly initialized features can lead to noisy self-attention scores and inferior convergence, particularly when processing the cell graphs with numerous connections. Thus, we further propose a novel topology-aware pretraining method that leverages a graph convolutional network (GCN) to learn a feature extractor. The pre-trained features may suppress unreasonable correlations and hence ease the finetuning of CGT. Experimental results suggest that the proposed cell graph transformer with topology-aware pretraining significantly improves the nuclei classification results, and achieves the state-of-the-art performance. Code and models are available at https://github.com/lhaof/CGT



Paperid:429
Authors:Changsheng Lu, Piotr Koniusz
The Australian National University, The Australian National University Data61/CSIRO
Abstract:
Recently the promptbased models have become popular across various language and vision tasks. Following that trend, we perform few-shot keypoint detection (FSKD) by detecting any keypoints in a query image, given the prompts formed by support images and keypoints. FSKD can be applied to detecting keypoints and poses of diverse animal species. In order to maintain flexibility of detecting varying number of keypoints, existing FSKD approaches modulate query feature map per support keypoint, then detect the corresponding keypoint from each modulated feature via a detection head. Such a separation of modulation-detection makes model heavy and slow when the number of keypoints increases. To overcome this issue, we design a novel light-weight detector which combines modulation and detection into one step, with the goal of reducing the computational cost without the drop of performance. Moreover, to bridge the large domain shift of keypoints between seen and unseen species, we further improve our model with mean feature based contrastive learning to align keypoint distributions, resulting in better keypoint representations for FSKD. Compared to the state of the art, our light-weight detector reduces the number of parameters by 50%, training/test time by 50%, and achieves 5.62% accuracy gain on 1-shot novel keypoint detection in the Animal pose dataset. Our model is also robust to the number of keypoints and saves memory when evaluating a large number of keypoints (e.g., 1000) per episode.



Paperid:430
Authors:Hui Lu, Albert Ali Salah, Ronald Poppe
Utrecht University, Utrecht University, Utrecht University
Abstract:
A key challenge in continuous sign language recognition (CSLR) is to efficiently capture longrange spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. This facilitates extracting region trajectory patterns. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movement, of a region in motion. TCNet's correlation module utilizes a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5\% and 1.0\% word error rate on PHOENIX14 and PHOENIX14-T, respectively. Code is available at https://github.com/hotfinda/TCNet



Paperid:431
Authors:Yanzuo Lu, Meng Shen, Andy J Ma, Xiaohua Xie, Jian-Huang Lai
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Guangdong Province Key Laboratory of Information Security Technology, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Guangdong Province Key Laboratory of Information Security Technology, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Guangdong Province Key Laboratory of Information Security Technology, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China Pazhou Lab (HuangPu), Guangzhou, China
Abstract:
Universal domain adaptation (UniDA) is a practical but challenging problem, in which information about the relation between the source and the target domains is not given for knowledge transfer. Existing UniDA methods may suffer from the problems of overlooking intradomain variations in the target domain and difficulty in separating between the similar known and unknown class. To address these issues, we propose a novel Mutual Learning Network (MLNet) with neighborhood invariance for UniDA. In our method, confidence-guided invariant feature learning with self-adaptive neighbor selection is designed to reduce the intra-domain variations for more generalizable feature representation. By using the cross-domain mixup scheme for better unknown-class identification, the proposed method compensates for the misidentified known-class errors by mutual learning between the closed-set and open-set classifiers. Extensive experiments on three publicly available benchmarks demonstrate that our method achieves the best results compared to the state-of-the-arts in most cases and significantly outperforms the baseline across all the four settings in UniDA. Code is available at https://github.com/YanzuoLu/MLNet.



Paperid:432
Authors:Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Peng Li, Yan Wang, Bing Li, Weiming Hu
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Alibaba Group Zhejiang Linkheer Science and Technology Co., Ltd., Alibaba Group Zhejiang Linkheer Science and Technology Co., Ltd., State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA School of Artificial Intelligence, University of Chinese Academy of Sciences School of Information Science and Technology, ShanghaiTech University
Abstract:
Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its groundtruth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.



Paperid:433
Authors:Yiheng Lu, Ziyu Guan, Yaming Yang, Wei Zhao, Maoguo Gong, Cai Xu
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Structured pruning techniques have achieved great compression performance on convolutional neural networks for image classification tasks. However, the majority of existing methods are sensitive with respect to the model parameters, and their pruning results may be unsatisfactory when the original model is trained poorly. That is, they need the original model to be fully trained, to obtain useful weight information. This is timeconsuming, and makes the effectiveness of the pruning results dependent on the degree of model optimization. To address the above issue, we propose a novel metric named Average Filter Information Entropy (AFIE). It decomposes the weight matrix of each layer into a low-rank space, and quantifies the filter importance based on the distribution of the normalized eigenvalues. Intuitively, the eigenvalues capture the covariance among filters, and therefore could be a good guide for pruning. Since the distribution of eigenvalues is robust to the updating of parameters, AFIE can yield a stable evaluation for the importance of each filter no matter whether the original model is trained fully. We implement our AFIE-based pruning method for three popular CNN models of AlexNet, VGG-16, and ResNet-50, and test them on three widely-used image datasets MNIST, CIFAR-10, and ImageNet, respectively. The experimental results are encouraging. We surprisingly observe that for our methods, even when the original model is trained with only one epoch, the AFIE score of each filter keeps identical to the results when the model is fully-trained. This fully indicates the effectiveness of the proposed pruning method.



Paperid:434
Authors:Zhan Lu, Qian Zheng, Boxin Shi, Xudong Jiang
School of Electrical and Electronic Engineering, Nanyang Technological University, College of Computer Science and Technology, Zhejiang University The State Key Lab of Brain-Machine Intelligence, Zhejiang University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, School of Electrical and Electronic Engineering, Nanyang Technological University
Abstract:
Panoramic imaging research on geometry recovery and High Dynamic Range (HDR) reconstruction becomes a trend with the development of Extended Reality (XR). Neural Radiance Fields (NeRF) provide a promising scene representation for both tasks without requiring extensive prior data. However, in the case of inputting sparse Low Dynamic Range (LDR) panoramic images, NeRF often degrades with under-constrained geometry and is unable to reconstruct HDR radiance from LDR inputs. We observe that the radiance from each pixel in panoramic images can be modeled as both a signal to convey scene lighting information and a light source to illuminate other pixels. Hence, we propose the irradiance fields from sparse LDR panoramic images, which increases the observation counts for faithful geometry recovery and leverages the irradiance-radiance attenuation for HDR reconstruction. Extensive experiments demonstrate that the irradiance fields outperform state-of-the-art methods on both geometry recovery and HDR reconstruction and validate their effectiveness. Furthermore, we show a promising byproduct of spatially-varying lighting estimation. The code is available at https://github.com/Lu-Zhan/Pano-NeRF.



Paperid:435
Authors:Ziyang Lu, Yunqiang Pei, Guoqing Wang, Peiwei Li, Yang Yang, Yinjie Lei, Heng Tao Shen
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Science and Technology of China, University of Electronic Science and Technology of China, Sichuan University, University of Electronic Science and Technology of China
Abstract:
Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for humanrobot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearance. To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU dataset is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are available in the ScanERU repository.



Paperid:436
Authors:Dai Luanyuan, Xiaoyu Du, Hanwang Zhang, Jinhui Tang
Nanjing University of Science and Technology, China, Nanjing University of Science and Technology, China, Nanyang Technological University, Singapore, Nanjing University of Science and Technology, China
Abstract:
Learning correspondences aims to find correct correspondences (inliers) from the initial correspondence set with an uneven correspondence distribution and a low inlier rate, which can be regarded as graph data. Recent advances usually use graph neural networks (GNNs) to build a single type of graph or simply stack local graphs into the global one to complete the task. But they ignore the complementary relationship between different types of graphs, which can effectively capture potential relationships among sparse correspondences. To address this problem, we propose MGNet to effectively combine multiple complementary graphs. To obtain information integrating implicit and explicit local graphs, we construct local graphs from implicit and explicit aspects and combine them effectively, which is used to build a global graph. Moreover, we propose Graph Soft Degree Attention (GSDA) to make full use of all sparse correspondence information at once in the global graph, which can capture and amplify discriminative features. Extensive experiments demonstrate that MGNet outperforms stateof-the-art methods in different visual tasks. The code is provided in https://github.com/DAILUANYUAN/MGNet-2024AAAI.



Paperid:437
Authors:Ao Luo, Linxin Song, Keisuke Nonaka, Kyohei Unno, Heming Sun, Masayuki Goto, Jiro Katto
KDDI Research, Inc. Waseda University, Waseda University, KDDI Research, Inc., KDDI Research, Inc., Yokohama National University, Waseda University, Waseda University
Abstract:
In recent years, the task of learned point cloud compression has gained prominence. An important type of point cloud, LiDAR point cloud, is generated by spinning LiDAR on vehicles. This process results in numerous circular shapes and azimuthal angle invariance features within the point clouds. However, these two features have been largely overlooked by previous methodologies. In this paper, we introduce a modelagnostic method called Spherical-Coordinate-based learned Point cloud compression (SCP), designed to fully leverage the features of circular shapes and azimuthal angle invariance. Additionally, we propose a multi-level Octree for SCP to mitigate the reconstruction error for distant areas within the Spherical-coordinate-based Octree. SCP exhibits excellent universality, making it applicable to various learned point cloud compression techniques. Experimental results demonstrate that SCP surpasses previous state-of-the-art methods by up to 29.14% in point-to-point PSNR BD-Rate.



Paperid:438
Authors:Chunjie Luo, Fei Luo, Yusen Wang, Enxu Zhao, Chunxia Xiao
School of Computer Science, Wuhan University, Wuhan, China, School of Computer Science, Wuhan University, Wuhan, China, School of Computer Science, Wuhan University, Wuhan, China, School of Computer Science, Wuhan University, Wuhan, China, School of Computer Science, Wuhan University, Wuhan, China
Abstract:
Reconstructing a dynamic human with loose clothing is an important but difficult task. To address this challenge, we propose a method named DLCARecon to create human avatars from monocular videos. The distance from loose clothing to the underlying body rapidly changes in every frame when the human freely moves and acts. Previous methods lack effective geometric initialization and constraints for guiding the optimization of deformation to explain this dramatic change, resulting in the discontinuous and incomplete reconstruction surface.To model the deformation more accurately, we propose to initialize an estimated 3D clothed human in the canonical space, as it is easier for deformation fields to learn from the clothed human than from SMPL.With both representations of explicit mesh and implicit SDF, we utilize the physical connection information between consecutive frames and propose a dynamic deformation field (DDF) to optimize deformation fields. DDF accounts for contributive forces on loose clothing to enhance the interpretability of deformations and effectively capture the free movement of loose clothing. Moreover, we propagate SMPL skinning weights to each individual and refine pose and skinning weights during the optimization to improve skinning transformation. Based on more reasonable initialization and DDF, we can simulate real-world physics more accurately. Extensive experiments on public and our own datasets validate that our method can produce superior results for humans with loose clothing compared to the SOTA methods.



Paperid:439
Authors:Fulin Luo, Xi Chen, Xiuwen Gong, Weiwen Wu, Tan Guo
College of Computer Science, Chongqing University, College of Computer Science, Chongqing University, Faculty of Engineering, The University of Sydney, Department of Biomedical Engineering, Sun-Yat-sen University, School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications
Abstract:
Coded aperture snapshot spectral imaging (CASSI) system is an effective manner for hyperspectral snapshot compressive imaging. The core issue of CASSI is to solve the inverse problem for the reconstruction of hyperspectral image (HSI). In recent years, Transformerbased methods achieve promising performance in HSI reconstruction. However, capturing both long-range dependencies and local information while ensuring reasonable computational costs remains a challenging problem. In this paper, we propose a Transformer-based HSI reconstruction method called dual-window multiscale Transformer (DWMT), which is a coarse-to-fine process, reconstructing the global properties of HSI with the long-range dependencies. In our method, we propose a novel U-Net architecture using a dual-branch encoder to refine pixel information and full-scale skip connections to fuse different features, enhancing the extraction of fine-grained features. Meanwhile, we design a novel self-attention mechanism called dual-window multiscale multi-head self-attention (DWM-MSA), which utilizes two different-sized windows to compute self-attention, which can capture the long-range dependencies in a local region at different scales to improve the reconstruction performance. We also propose a novel position embedding method for Transformer, named con-abs position embedding (CAPE), which effectively enhances positional information of the HSIs. Extensive experiments on both the simulated and the real data are conducted to demonstrate the superior performance, stability, and generalization ability of our DWMT. Code of this project is at https://github.com/chenx2000/DWMT.



Paperid:440
Authors:Naisong Luo, Rui Sun, Yuwen Pan, Tianzhu Zhang, Feng Wu
Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
Automatic mitochondrial segmentation enjoys great popularity with the development of deep learning. However, the coarse prediction raised by the presence of regular 3D grids in previous methods regardless of 3D CNN or the vision transformers suggest a possibly suboptimal feature arrangement. To mitigate this limitation, we attempt to interpret the 3D EM image stacks as a set of interrelated 3D fragments for a better solution. However, it is non-trivial to model the 3D fragments without introducing excessive computational overhead. In this paper, we design a coherent fragment vision transformer (FragViT) combined with affinity learning to manipulate features on 3D fragments yet explore mutual relationships to model fragment-wise context, enjoying locality prior without sacrificing global reception. The proposed FragViT includes a fragment encoder and a hierarchical fragment aggregation module. The fragment encoder is equipped with affinity heads to transform the tokens into fragments with homogeneous semantics, and the multi-layer self-attention is used to explicitly learn inter-fragment relations with long-range dependencies. The hierarchical fragment aggregation module is responsible for hierarchically aggregating fragment-wise prediction back to the final voxel-wise prediction in a progressive manner. Extensive experimental results on the challenging MitoEM, Lucchi, and AC3/AC4 benchmarks demonstrate the effectiveness of the proposed method.



Paperid:441
Authors:Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, Min Yang
Shenzen Institute of Advanced Technology, Chinese Academy of Sciences, Huazhong University of Science and Technology, Huazhong University of Science and Technology, University of California Santa Barbara, Huazhong University of Science and Technology, Shenzen Institute of Advanced Technology, Chinese Academy of Sciences
Abstract:
Multiobject tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods. Code is available at https://github.com/RainBowLuoCS/DiffusionTrack.



Paperid:442
Authors:Shenghong Luo, Xuhang Chen, Weiwen Chen, Zinuo Li, Shuqiang Wang, Chi-Man Pun
University of Macau, University of Macau Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Huizhou University, University of Macau Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, University of Macau Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, University of Macau
Abstract:
Vignetting commonly occurs as a degradation in images resulting from factors such as lens design, improper lens hood usage, and limitations in camera sensors. This degradation affects image details, color accuracy, and presents challenges in computational photography. Existing vignetting removal algorithms predominantly rely on ideal physics assumptions and handcrafted parameters, resulting in the ineffective removal of irregular vignetting and suboptimal results. Moreover, the substantial lack of real-world vignetting datasets hinders the objective and comprehensive evaluation of vignetting removal. To address these challenges, we present VigSet, a pioneering dataset for vignetting removal. VigSet includes 983 pairs of both vignetting and vignetting-free high-resolution (over 4k) real-world images under various conditions. In addition, We introduce DeVigNet, a novel frequency-aware Transformer architecture designed for vignetting removal. Through the Laplacian Pyramid decomposition, we propose the Dual Aggregated Fusion Transformer to handle global features and remove vignetting in the low-frequency domain. Additionally, we propose the Adaptive Channel Expansion Module to enhance details in the high-frequency domain. The experiments demonstrate that the proposed model outperforms existing state-of-the-art methods. The code, models, and dataset are available at https://github.com/CXH-Research/DeVigNet.



Paperid:443
Authors:Xiaotong Luo, Zekun Ai, Qiuyuan Liang, Ding Liu, Yuan Xie, Yanyun Qu, Yun Fu
Xiamen University, Xiamen University, Xiamen University, Bytedance, East China Normal University, Xiamen University, Northeastern University
Abstract:
Efficient transformerbased models have made remarkable progress in image super-resolution (SR). Most of these works mainly design elaborate structures to accelerate the inference of the transformer, where all feature tokens are propagated equally. However, they ignore the underlying characteristic of image content, i.e., various image regions have distinct restoration difficulties, especially for large images (2K-8K), failing to achieve adaptive inference. In this work, we propose an adaptive token sparsification transformer (AdaFormer) to speed up the model inference for image SR. Specifically, a texture-relevant sparse attention block with parallel global and local branches is introduced, aiming to integrate informative tokens from the global view instead of only in fixed local windows. Then, an early-exit strategy is designed to progressively halt tokens according to the token importance. To estimate the plausibility of each token, we adopt a lightweight confidence estimator, which is constrained by an uncertainty-guided loss to obtain a binary halting mask about the tokens. Experiments on large images have illustrated that our proposal reduces nearly 90% latency against SwinIR on Test8K, while maintaining a comparable performance.



Paperid:444
Authors:Xiaotong Luo, Yuan Xie, Yanyun Qu, Yun Fu
Xiamen University, East China Normal University, Xiamen University, Northeastern University
Abstract:
It is wellknown that image quality assessment usually meets with the problem of perception-distortion (p-d) tradeoff. The existing deep image super-resolution (SR) methods either focus on high fidelity with pixel-level objectives or high perception with generative models. The emergence of diffusion model paves a fresh way for image restoration, which has the potential to offer a brand-new solution for p-d trade-off. We experimentally observed that the perceptual quality and distortion change in an opposite direction with the increase of sampling steps. In light of this property, we propose an adaptive skip diffusion model (SkipDiff), which aims to achieve high-fidelity perceptual image SR with fewer sampling steps. Specifically, it decouples the sampling procedure into coarse skip approximation and fine skip refinement stages. A coarse-grained skip diffusion is first performed as a high-fidelity prior to obtaining a latent approximation of the full diffusion. Then, a fine-grained skip diffusion is followed to further refine the latent sample for promoting perception, where the fine time steps are adaptively learned by deep reinforcement learning. Meanwhile, this approach also enables faster sampling of diffusion model through skipping the intermediate denoising process to shorten the effective steps of the computation. Extensive experimental results show that our SkipDiff achieves superior perceptual quality with plausible reconstruction accuracy and a faster sampling speed.



Paperid:445
Authors:Zhipeng Luo, Gongjie Zhang, Changqing Zhou, Zhonghua Wu, Qingyi Tao, Lewei Lu, Shijian Lu
S-Lab, Nanyang Technological University Black Sesame Technologies, S-Lab, Nanyang Technological University, SenseTime Research, SenseTime Research, SenseTime Research, SenseTime Research, S-Lab, Nanyang Technological University
Abstract:
The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the longrange continuous motion property of objects in 3D space. To address this issue, this paper presents a novel approach that views each tracklet as a continuous stream: at each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank, enabling efficient exploitation of sequential information. To achieve effective cross-frame message passing, a hybrid attention mechanism is designed to account for both long-range relation modeling and local geometric feature extraction. Furthermore, to enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art method by significant margins on multiple benchmarks.



Paperid:446
Authors:Changsheng Lv, Mengshi Qi, Xia Li, Zhengyuan Yang, Huadong Ma
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications, Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, University of Rochester, Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications
Abstract:
In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloudbased 3D scene graph generation. The task aims to parse a point cloud-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure. Existing methods based on graph convolutional networks (GCNs) suffer from the over-smoothing dilemma and can only propagate information from limited neighboring nodes. In contrast, SGFormer uses Transformer layers as the base building block to allow global information passing, with two types of newly-designed layers tailored for the 3D scene graph generation task. Specifically, we introduce the graph embedding layer to best utilize the global information in graph edges while maintaining comparable computation costs. Furthermore, we propose the semantic injection layer to leverage linguistic knowledge from large-scale language model (i.e., ChatGPT), to enhance objects' visual features. We benchmark our SGFormer on the established 3DSSG dataset and achieve a 40.94% absolute improvement in relationship prediction's R@50 and an 88.36% boost on the subset with complex scenes over the state-of-the-art. Our analyses further show SGFormer's superiority in the long-tail and zero-shot scenarios. Our source code is available at https://github.com/Andy20178/SGFormer.



Paperid:447
Authors:Cheng Lyu, Jiake Xie, Bo Xu, Cheng Lu, Han Huang, Xin Huang, Ming Wu, Chuang Zhang, Yong Tang
Beijing University of Posts and Telecommunications, PicUP.Ai, Xpeng, Xpeng, AI^2 Robotics, Towson University, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, PicUP.Ai
Abstract:
Performance of trimapfree image matting methods is limited when trying to decouple the deterministic and undetermined regions, especially in the scenes where foregrounds are semantically ambiguous, chromaless, or high transmittance. In this paper, we propose a novel framework named Privileged Prior Information Distillation for Image Matting (PPID-IM) that can effectively transfer privileged prior environment-aware information to improve the performance of trimap-free students in solving hard foregrounds. The prior information of trimap regulates only the teacher model during the training stage, while not being fed into the student network during actual inference. To achieve effective privileged cross-modality (i.e. trimap and RGB) information distillation, we introduce a Cross-Level Semantic Distillation (CLSD) module that reinforces the students with more knowledgeable semantic representations and environment-aware information. We also propose an Attention-Guided Local Distillation module that efficiently transfers privileged local attributes from the trimap-based teacher to trimap-free students for the guidance of local-region optimization. Extensive experiments demonstrate the effectiveness and superiority of our PPID on image matting. The code will be released soon.



Paperid:448
Authors:Boyuan Ma, Xiang Yin, Jing Tan, Yongfeng Chen, Haiyou Huang, Hao Wang, Weihua Xue, Xiaojuan Ban
University of Science and Technology Beijing, University of Science and Technology Beijing, University of Science and Technology Beijing, University of Science and Technology Beijing, University of Science and Technology Beijing, University of Science and Technology Beijing, Liaoning Technical University, University of Science and Technology Beijing
Abstract:
Federated learning collaboratively trains machine learning models among different clients while keeping data privacy and has become the mainstream for breaking data silos. However, the nonindependently and identically distribution (i.e., Non-IID) characteristic of different image domains among different clients reduces the benefits of federated learning and has become a bottleneck problem restricting the accuracy and generalization of federated models. In this work, we propose a novel federated image segmentation method based on style transfer, FedST, by using a denoising diffusion probabilistic model to achieve feature disentanglement and image synthesis of cross-domain image data between multiple clients. Thus it can share style features among clients while protecting structure features of image data, which effectively alleviates the influence of the Non-IID phenomenon. Experiments prove that our method achieves superior segmentation performance compared to state-of-art methods among four different Non-IID datasets in objective and subjective assessment. The code is available at https://github.com/YoferChen/FedST.



Paperid:449
Authors:Chen Ma, Ningfei Wang, Qi Alfred Chen, Chao Shen
Xi'an Jiaotong University, University of California, Irvine, University of California, Irvine, Xi'an Jiaotong University
Abstract:
In Autonomous Driving (AD), realtime perception is a critical component responsible for detecting surrounding objects to ensure safe driving. While researchers have extensively explored the integrity of AD perception due to its safety and security implications, the aspect of availability (real-time performance) or latency has received limited attention. Existing works on latency-based attack have focused mainly on object detection, i.e., a component in camera-based AD perception, overlooking the entire camera-based AD perception, which hinders them to achieve effective system-level effects, such as vehicle crashes. In this paper, we propose SlowTrack, a novel framework for generating adversarial attacks to increase the execution time of camera-based AD perception. We propose a novel two-stage attack strategy along with the three new loss function designs. Our evaluation is conducted on four popular camera-based AD perception pipelines, and the results demonstrate that SlowTrack significantly outperforms existing latency-based attacks while maintaining comparable imperceptibility levels. Furthermore, we perform the evaluation on Baidu Apollo, an industry-grade full-stack AD system, and LGSVL, a production-grade AD simulator, with two scenarios to compare the system-level effects of SlowTrack and existing attacks. Our evaluation results show that the system-level effects can be significantly improved, i.e., the vehicle crash rate of SlowTrack is around 95% on average while existing works only have around 30%.



Paperid:450
Authors:Chenxi Ma
Fudan University
Abstract:
Generative adversarial network (GAN) has become a popular tool in the perceptualoriented single image super-resolution (SISR) for its excellent capability to hallucinate details. However, the performance of most GAN-based SISR methods is impeded due to the limited discriminative ability of their discriminators. In specific, these discriminators only focus on the global image reconstruction quality and ignore the more fine-grained reconstruction quality for constraining the generator, as they predict the overall realness of an image instead of the pixel-level realness. Here, we first introduce the uncertainty into the GAN and propose an Uncertainty-aware GAN (UGAN) to regularize SISR solutions, where the challenging pixels with large reconstruction uncertainty and importance (e.g., texture and edge) are prioritized for optimization. The uncertainty-aware adversarial training strategy enables the discriminator to capture the pixel-level SR uncertainty, which constrains the generator to focus on image areas with high reconstruction difficulty, meanwhile, it improves the interpretability of the SR. To balance weights of multiple training losses, we introduce an uncertainty-aware loss weighting strategy to adaptively learn the optimal loss weights. Extensive experiments demonstrate the effectiveness of our approach in extracting the SR uncertainty and the superiority of the UGAN over the state-of-the-arts in terms of the reconstruction accuracy and perceptual quality.



Paperid:451
Authors:Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Yi Yang
Zhejiang University, Bytedance Inc., Bytedance Inc., Bytedance Inc., Zhejiang University, Zhejiang University
Abstract:
Videolanguage pre-training models have recently achieved remarkable results on various multi-modal downstream tasks. However, most of these models rely on contrastive learning or masking modeling to align global features across modalities, neglecting the local associations between video frames and text tokens. This limits the model’s ability to perform fine-grained matching and generalization, especially for tasks that selecting segments in long videos based on query texts. To address this issue, we propose a novel stitching and matching pre-text task for video-language pre-training that encourages fine-grained interactions between modalities. Our task involves stitching video frames or sentences into longer sequences and predicting the positions of cross-model queries in the stitched sequences. The individual frame and sentence representations are thus aligned via the stitching and matching strategy, encouraging the fine-grained interactions between videos and texts. in the stitched sequences for the cross-modal query. We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pre-training models.



Paperid:452
Authors:Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun
University of Science and Technology of China, WeChat, Tencent Inc., WeChat, Tencent Inc., University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
Image captioning requires numerous annotated imagetext pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This potential can be harnessed to create synthetic image-text pairs for training captioning models. Synthetic data can improve cost and time efficiency in data collection, allow for customization to specific domains, bootstrap generalization capability for zero-shot performance, and circumvent privacy concerns associated with real-world data. However, existing methods struggle to attain satisfactory performance solely through synthetic data. We identify the issue as generated images from simple descriptions mostly capture a solitary perspective with limited context, failing to align with the intricate scenes prevalent in real-world imagery. To tackle this, we present an innovative pipeline that introduces multi-context data generation. Beginning with an initial text corpus, our approach employs a large language model to extract multiple sentences portraying the same scene from diverse viewpoints. These sentences are then condensed into a single sentence with multiple contexts. Subsequently, we generate intricate images using the condensed captions through diffusion models. Our model is exclusively trained on synthetic image-text pairs crafted through this process. The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and NoCaps.



Paperid:453
Authors:Wan-Duo Kurt Ma, Avisek Lahiri, J. P. Lewis, Thomas Leung, W. Bastiaan Kleijn
Victoria University of Wellington, Google Research, NVIDIA Research, Google Research, Victoria University of Wellington Google Research
Abstract:
Textguided diffusion models such as DALLE-2, Imagen, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.



Paperid:454
Authors:Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang
Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Intelligent Science Technology Academy of CASIC, Intelligent Science Technology Academy of CASIC
Abstract:
Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modalityunified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.



Paperid:455
Authors:Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, Qifeng Chen
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, The Hong Kong University of Science and Technology, Hong Kong, Tencent AI Lab, Shenzhen, China, Tencent AI Lab, Shenzhen, China, Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Shenzhen, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, The Hong Kong University of Science and Technology, Hong Kong
Abstract:
Generating texteditable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models are available on https://follow-your-pose.github.io/.



Paperid:456
Authors:Zhe Ma, Jianfeng Dong, Shouling Ji, Zhenguang Liu, Xuhong Zhang, Zonghui Wang, Sifeng He, Feng Qian, Xiaobo Zhang, Lei Yang
Zhejiang University, Zhejiang Gongshang University Zhejiang Key Lab of E-Commerce, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Ant Group, Ant Group, Ant Group, Ant Group
Abstract:
Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multiteacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at https://github.com/Maryeon/whiten_mtd.



Paperid:457
Authors:Zhen-Xiang Ma, Zhen-Duo Chen, Li-Jun Zhao, Zi-Chao Zhang, Xin Luo, Xin-Shun Xu
Shandong University, Shandong University, Shandong University, Shandong University, Shandong University, Shandong University
Abstract:
Recently, a number of FewShot Fine-Grained Image Classification (FS-FGIC) methods have been proposed, but they primarily focus on better fine-grained feature extraction while overlooking two important issues. The first one is how to extract discriminative features for Fine-Grained Image Classification tasks while reducing trivial and non-generalizable sample level noise introduced in this procedure, to overcome the over-fitting problem under the setting of Few-Shot Learning. The second one is how to achieve satisfying feature matching between limited support and query samples with variable spatial positions and angles. To address these issues, we propose a novel Cross-layer and Cross-sample feature optimization Network for FS-FGIC, C2-Net for short. The proposed method consists of two main modules: Cross-Layer Feature Refinement (CLFR) module and Cross-Sample Feature Adjustment (CSFA) module. The CLFR module further refines the extracted features while integrating outputs from multiple layers to suppress sample-level feature noise interference. Additionally, the CSFA module addresses the feature mismatch between query and support samples through both channel activation and position matching operations. Extensive experiments have been conducted on five fine-grained benchmark datasets, and the results show that the C2-Net outperforms other state-of-the-art methods by a significant margin in most cases. Our code is available at: https://github.com/zenith0923/C2-Net.



Paperid:458
Authors:Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Bowen Zhou
Tsinghua University, Huazhong University of Science and Technology, School of Computer Science and Technology, Huazhong University of Science and Technology, Tsinghua University
Abstract:
As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in highresolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with Latent Masking Diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches.



Paperid:459
Authors:Zhiyuan Ma, Guoli Jia, Bowen Zhou
Tsinghua University Shanghai Artificial Intelligence Laboratory, NanKai University, Tsinghua University
Abstract:
With the great success of textconditioned diffusion models in creative text-to-image generation, various text-driven image editing approaches have attracted the attentions of many researchers. However, previous works mainly focus on discreteness-sensitive instructions such as adding, removing or replacing specific objects, background elements or global styles (i.e., “hard editing”), while generally ignoring subject-binding but semantically fine-changing continuity-sensitive instructions such as actions, poses or adjectives, and so on (i.e., “soft editing”), which hampers generative AI from generating user-customized visual contents. To mitigate this predicament, we propose a spatio-temporal guided adaptive editing algorithm AdapEdit, which realizes adaptive image editing by introducing a soft-attention strategy to dynamically vary the guiding degree from the editing conditions to visual pixels from both temporal and spatial perspectives. Note our approach has a significant advantage in preserving model priors and does not require model training, fine-tuning, extra data, or optimization. We present our results over a wide variety of raw images and editing instructions, demonstrating competitive performance and showing it significantly outperforms the previous approaches. Code is available: https://github.com/AnonymousPony/adap-edit.



Paperid:460
Authors:Huayu Mai, Rui Sun, Yuan Wang, Tianzhu Zhang, Feng Wu
Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
Video semantic segmentation has achieved conspicuous achievements attributed to the development of deep learning, but suffers from laborintensive annotated training data gathering. To alleviate the data-hunger issue, domain adaptation approaches are developed in the hope of adapting the model trained on the labeled synthetic videos to the real videos in the absence of annotations. By analyzing the dominant paradigm consistency regularization in the domain adaptation task, we find that the bottlenecks exist in previous methods from the perspective of pseudo-labels. To take full advantage of the information contained in the pseudo-labels and empower more effective supervision signals, we propose a coherent PAT network including a target domain focalizer and relation-aware temporal consistency. The proposed PAT network enjoys several merits. First, the target domain focalizer is responsible for paying attention to the target domain, and increasing the accessibility of pseudo-labels in consistency training. Second, the relation-aware temporal consistency aims at modeling the inter-class consistent relationship across frames to equip the model with effective supervision signals. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art domain adaptive video semantic segmentation methods.



Paperid:461
Authors:Oscar Mañas, Benno Krojer, Aishwarya Agrawal
Mila - Quebec AI Institute Université de Montréal, Mila - Quebec AI Institute McGill University, Mila - Quebec AI Institute Université de Montréal
Abstract:
8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accuracy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards openended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.



Paperid:462
Authors:Ruiyu Mao, Ouyang Xu, Yunhui Guo
University of Texas at Dallas, University of Texas at Dallas, University of Texas at Dallas
Abstract:
Active learning, a method to reduce labeling effort for training deep neural networks, is often limited by the assumption that all unlabeled data belong to known classes. This closedworld assumption fails in practical scenarios with unknown classes in the data, leading to active open-set annotation challenges. Existing methods struggle with this uncertainty. We introduce NEAT, a novel, computationally efficient, data-centric active learning approach for open-set data. NEAT differentiates and labels known classes from a mix of known and unknown classes, using a clusterability criterion and a consistency mea- sure that detects inconsistencies between model predictions and feature distribution. In contrast to recent learning-centric solutions, NEAT shows superior performance in active open- set annotation, as our experiments confirm. Additional details on the further evaluation metrics, implementation, and archi- tecture of our method can be found in the public document at https://arxiv.org/pdf/2401.04923.pdf.



Paperid:463
Authors:Ge Meng, Jingjia Huang, Yingying Wang, Zhenqi Fu, Xinghao Ding, Yue Huang
Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University
Abstract:
Pansharpening aims to leverage the high-frequency signal of the panchromatic (PAN) image to enhance the resolution of its corresponding multi-spectral (MS) image. However, deep neural networks (DNNs) tend to prioritize learning the low-frequency components during the training process, which limits the restoration of high-frequency edge details in MS images. To overcome this limitation, we treat pan-sharpening as a coarse-to-fine high-frequency restoration problem and propose a novel method for achieving high-quality restoration of edge information in MS images. Specifically, to effectively obtain fine-grained multi-scale contextual features, we design a Band-limited Multi-scale High-frequency Generator (BMHG) that generates high-frequency signals from the PAN image within different bandwidths. During training, higher-frequency signals are progressively injected into the MS image, and corresponding residual blocks are introduced into the network simultaneously. This design enables gradients to flow from later to earlier blocks smoothly, encouraging intermediate blocks to concentrate on missing details. Furthermore, to address the issue of pixel position misalignment arising from multi-scale features fusion, we propose a Spatial-spectral Implicit Image Function (SIIF) that employs implicit neural representation to effectively represent and fuse spatial and spectral features in the continuous domain. Extensive experiments on different datasets demonstrate that our method outperforms existing approaches in terms of quantitative and visual measurements for high-frequency detail recovery.



Paperid:464
Authors:Runqi Meng, Xiao Zhang, Shijie Huang, Yuning Gu, Guiqin Liu, Guangyu Wu, Nizhuan Wang, Kaicong Sun, Dinggang Shen
ShanghaiTech University, Northwest University; ShanghaiTech University, ShanghaiTech University, ShanghaiTech University, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University School of Medicine, ShanghaiTech University, ShanghaiTech University, ShanghaiTech University; Shanghai United Imaging Intelligence Co., Ltd.; Shanghai Clinical Research and Trial Center
Abstract:
Accurate segmentation of prostate tumors from multimodal magnetic resonance (MR) images is crucial for diagnosis and treatment of prostate cancer. However, the robustness of existing segmentation methods is limited, mainly because these methods 1) fail to adaptively assess subject-specific information of each MR modality for accurate tumor delineation, and 2) lack effective utilization of inter-slice information across thick slices in MR images to segment tumor as a whole 3D volume. In this work, we propose a two-stage neighbor-aware multi-modal adaptive learning network (NaMa) for accurate prostate tumor segmentation from multi-modal anisotropic MR images. In particular, in the first stage, we apply subject-specific multi-modal fusion in each slice by developing a novel modality-informativeness adaptive learning (MIAL) module for selecting and adaptively fusing informative representation of each modality based on inter-modality correlations. In the second stage, we exploit inter-slice feature correlations to derive volumetric tumor segmentation. Specifically, we first use a Unet variant with sequence layers to coarsely capture slice relationship at a global scale, and further generate an activation map for each slice. Then, we introduce an activation mapping guidance (AMG) module to refine slice-wise representation (via information from adjacent slices) for consistent tumor segmentation across neighboring slices. Besides, during the network training, we further apply a random mask strategy to each MR modality to improve feature representation efficiency. Experiments on both in-house and public (PICAI) multi-modal prostate tumor datasets show that our proposed NaMa performs better than state-of-the-art methods.



Paperid:465
Authors:Li Mi, Syrielle Montariol, Javiera Castillo Navarro, Xianjie Dai, Antoine Bosselut, Devis Tuia
EPFL, EPFL, EPFL, EPFL, EPFL, EPFL
Abstract:
Asking questions about visual environments is a crucial way for intelligent agents to understand rich multifaceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.



Paperid:466
Authors:Wenjun Miao, Guansong Pang, Xiao Bai, Tianqi Li, Jin Zheng
School of Computer Science and Engineering, Beihang University, School of Computing and Information Systems, Singapore Management University, School of Computer Science and Engineering, Beihang University State Key Laboratory of Software Development Environment, Jiangxi Research Institute, Beihang University, School of Computer Science and Engineering, Beihang University, School of Computer Science and Engineering, Beihang University State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Abstract:
Existing outof-distribution (OOD) methods have shown great success on balanced datasets but become ineffective in long-tailed recognition (LTR) scenarios where 1) OOD samples are often wrongly classified into head classes and/or 2) tail-class samples are treated as OOD samples. To address these issues, current studies fit a prior distribution of auxiliary/pseudo OOD data to the long-tailed in-distribution (ID) data. However, it is difficult to obtain such an accurate prior distribution given the unknowingness of real OOD samples and heavy class imbalance in LTR. A straightforward solution to avoid the requirement of this prior is to learn an outlier class to encapsulate the OOD samples. The main challenge is then to tackle the aforementioned confusion between OOD samples and head/tail-class samples when learning the outlier class. To this end, we introduce a novel calibrated outlier class learning (COCL) approach, in which 1) a debiased large margin learning method is introduced in the outlier class learning to distinguish OOD samples from both head and tail classes in the representation space and 2) an outlier-class-aware logit calibration method is defined to enhance the long-tailed classification confidence. Extensive empirical results on three popular benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that COCL substantially outperforms existing state-of-the-art OOD detection methods in LTR while being able to improve the classification accuracy on ID data. Code is available at https://github.com/mala-lab/COCL.



Paperid:467
Authors:Xiangyang Miao, Guobao Xiao, Shiping Wang, Jun Yu
Tongji University Fuzhou University, Tongji University, Fuzhou University, Hangzhou Dianzi University
Abstract:
Correspondence pruning aims to establish reliable correspondences between two related images and recover relative camera motion. Existing approaches often employ a progressive strategy to handle the local and global contexts, with a prominent emphasis on transitioning from local to global, resulting in the neglect of interactions between different contexts. To tackle this issue, we propose a parallel context learning strategy that involves acquiring bilateral consensus for the twoview correspondence pruning task. In our approach, we design a distinctive self-attention block to capture global context and parallel process it with the established local context learning module, which enables us to simultaneously capture both local and global consensuses. By combining these local and global consensuses, we derive the required bilateral consensus. We also design a recalibration block, reducing the influence of erroneous consensus information and enhancing the robustness of the model. The culmination of our efforts is the Bilateral Consensus Learning Network (BCLNet), which efficiently estimates camera pose and identifies inliers (true correspondences). Extensive experiments results demonstrate that our network not only surpasses state-of-the-art methods on benchmark datasets but also showcases robust generalization abilities across various feature extraction techniques. Noteworthily, BCLNet obtains significant improvement gains over the second best method on unknown outdoor dataset, and obviously accelerates model training speed.



Paperid:468
Authors:Roy Miles, Krystian Mikolajczyk
Imperial College London, Imperial College London
Abstract:
In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to stateof-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.



Paperid:469
Authors:Zijian Min, Gundu Mohamed Hassan, Geun-Sik Jo
Inha University, Inha University, Inha University
Abstract:
The blind text image deblurring problem presents a formidable challenge, requiring the recovery of a clean and sharp text image from a blurry version with an unknown blur kernel. Sparsitybased strategies have demonstrated their efficacy by emphasizing the sparse priors of the latent image and kernel. However, these existing strategies have largely neglected the influence of additional noise, imposing limitations on their performance. To overcome this limitation, we propose a novel framework designed to effectively mitigate the impact of extensive noise prevalent in blurred images. Our approach centers around a robust Maximum Consensus Framework, wherein we optimize the quantity of interest from the noisy blurry image based on the maximum consensus criterion. Furthermore, we propose the integration of the Alternating Direction Method of Multipliers (ADMM) and the Half-Quadratic Splitting (HQS) method to address the computationally intractable L0 norm problem. This innovative strategy enables improvements in the deblurring performance of blurry text images with the additional synthetic noise. Experimental evaluations conducted on various noisy blurry text images demonstrate the superiority of the proposed approach over existing methods.



Paperid:470
Authors:Shankhanil Mitra, Rajiv Soundararajan
Indian Institute of Science, Indian Institute of Science
Abstract:
Perceptual quality assessment of user generated content (UGC) videos is challenging due to the requirement of large scale human annotated videos for training. In this work, we address this challenge by first designing a selfsupervised Spatio-Temporal Visual Quality Representation Learning (ST-VQRL) framework to generate robust quality aware features for videos. Then, we propose a dual-model based Semi Supervised Learning (SSL) method specifically designed for the Video Quality Assessment (SSL-VQA) task, through a novel knowledge transfer of quality predictions between the two models. Our SSL-VQA method uses the ST-VQRL backbone to produce robust performances across various VQA datasets including cross-database settings, despite being learned with limited human annotated videos. Our model improves the state-of-the-art performance when trained only with limited data by around 10%, and by around 15% when unlabelled data is also used in SSL. Source codes and checkpoints are available at https://github.com/Shankhanil006/SSL-VQA.



Paperid:471
Authors:Wentao Mo, Yang Liu
Wangxuan Institute of Computer Technology, Peking University, Wangxuan Institute of Computer Technology, Peking University National Key Laboratory of General Artificial Intelligence, Peking University
Abstract:
In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use topdown 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at https://github.com/matthewdm0816/BridgeQA.



Paperid:472
Authors:Bahram Mohammadi, Yicong Hong, Yuankai Qi, Qi Wu, Shirui Pan, Javen Qinfeng Shi
University of Adelaide, Australian National University, Macquarie University, University of Adelaide, Griffith University, University of Adelaide
Abstract:
The visionand-language navigation (VLN) task necessitates an agent to perceive the surroundings, follow natural language instructions, and act in photo-realistic unseen environments. Most of the existing methods employ the entire image or object features to represent navigable viewpoints. However, these representations are insufficient for proper action prediction, especially for the REVERIE task, which uses concise high-level instructions, such as “Bring me the blue cushion in the master bedroom”. To address enhancing representation, we propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as a spatio-temporal knowledge graph for improving agent navigation. Specifically, the proposed approach involves constructing a knowledge base by retrieving commonsense information from ConceptNet, followed by a refinement module to remove noisy and irrelevant knowledge. We further present ACK which consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment by integrating visible objects, commonsense knowledge, and concept history, which includes object and knowledge temporal information. Moreover, we add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction. Experimental results demonstrate our proposed model noticeably outperforms the baseline and archives the state-of-the-art on the REVERIE benchmark. The source code is available at https://github.com/Bahram-Mohammadi/ACK.



Paperid:473
Authors:Henrique Morimitsu, Xiaobin Zhu, Xiangyang Ji, Xu-Cheng Yin
University of Science and Technology Beijing, University of Science and Technology Beijing, Tsinghua University, University of Science and Technology Beijing
Abstract:
Optical flow estimation is a challenging task consisting of predicting perpixel motion vectors between images. Recent methods have employed larger and more complex models to improve the estimation accuracy. However, this impacts the widespread adoption of optical flow methods and makes it harder to train more general models since the optical flow data is hard to obtain. This paper proposes a small and efficient model for optical flow estimation. We design a new spatial recurrent encoder that extracts discriminative features at a significantly reduced size. Unlike standard recurrent units, we utilize Partial Kernel Convolution (PKConv) layers to produce variable multi-scale features with a single shared block. We also design efficient Separable Large Kernels (SLK) to capture large context information with low computational cost. Experiments on public benchmarks show that we achieve state-of-the-art generalization performance while requiring significantly fewer parameters and memory than competing methods. Our model ranks first in the Spring benchmark without finetuning, improving the results by over 10% while requiring an order of magnitude fewer FLOPs and over four times less memory than the following published method without finetuning. The code is available at github.com/hmorimitsu/ptlflow/tree/main/ptlflow/models/rpknet.



Paperid:474
Authors:Andrey Moskalenko, Vlad Shakhuro, Anna Vorontsova, Anton Konushin, Anton Antonov, Alexander Krapukhin, Denis Shepelev, Konstantin Soshin
Samsung Research, Samsung Research, Samsung Research, Samsung Research, Samsung Research, Samsung Research, Samsung Research, Samsung Research
Abstract:
Interactive segmentation methods rely on user inputs to iteratively update the selection mask. A click specifying the object of interest is arguably the most simple and intuitive interaction type, and thereby the most common choice for interactive segmentation. However, user clicking patterns in the interactive segmentation context remain unexplored. Accordingly, interactive segmentation evaluation strategies rely more on intuition and common sense rather than empirical studies (e.g., assuming that users tend to click in the center of the area with the largest error). In this work, we conduct a realuser study to investigate real user clicking patterns. This study reveals that the intuitive assumption made in the common evaluation strategy may not hold. As a result, interactive segmentation models may show high scores in the standard benchmarks, but it does not imply that they would perform well in a real world scenario. To assess the applicability of interactive segmentation methods, we propose a novel evaluation strategy providing a more comprehensive analysis of a model's performance. To this end, we propose a methodology for finding extreme user inputs by a direct optimization in a white-box adversarial attack on the interactive segmentation model. Based on the performance with such adversarial user inputs, we assess the robustness of interactive segmentation models w.r.t click positions. Besides, we introduce a novel benchmark for measuring the robustness of interactive segmentation, and report the results of an extensive evaluation of dozens of models.



Paperid:475
Authors:Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan
Peking University Shenzhen Graduate School ARC Lab, Tencent PCG, ARC Lab, Tencent PCG, ARC Lab, Tencent PCG University of Macau Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China, ARC Lab, Tencent PCG, Peking University Shenzhen Graduate School, ARC Lab, Tencent PCG, ARC Lab, Tencent PCG
Abstract:
The incredible generative ability of largescale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., structure and color) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn low-cost T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications. Our code is available at https://github.com/TencentARC/T2I-Adapter.



Paperid:476
Authors:Sahal Shaji Mullappilly, Abhishek Singh Gehlot, Rao Muhammad Anwer, Fahad Shahbaz Khan, Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence, Mohamed bin Zayed University of Artificial Intelligence, Mohamed bin Zayed University of Artificial Intelligence, Mohamed bin Zayed University of Artificial Intelligence Linköping University, Mohamed bin Zayed University of Artificial Intelligence
Abstract:
Conventional openworld object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks. However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages. Such reliance on run-time makes this formulation less realistic in a real-world deployment. To address this, we introduce a more realistic formulation, named semi-supervised open-world detection (SS-OWOD), that reduces the annotation cost by casting the incremental learning stages of OWOD in a semi-supervised manner. We demonstrate that the performance of the state-of-the-art OWOD detector dramatically deteriorates in the proposed SS-OWOD setting. Therefore, we introduce a novel SS-OWOD detector, named SS-OWFormer, that utilizes a feature-alignment scheme to better align the object query representations between the original and augmented images to leverage the large unlabeled and few labeled data. We further introduce a pseudo-labeling scheme for unknown detection that exploits the inherent capability of decoder object queries to capture object-specific information. On the COCO dataset, our SS-OWFormer using only 50% of the labeled data achieves detection performance that is on par with the state-of-the-art (SOTA) OWOD detector using all the 100% of labeled data. Further, our SS-OWFormer achieves an absolute gain of 4.8% in unknown recall over the SOTA OWOD detector. Lastly, we demonstrate the effectiveness of our SS-OWOD problem setting and approach for remote sensing object detection, proposing carefully curated splits and baseline performance evaluations. Our experiments on 4 datasets including MS COCO, PASCAL, Objects365 and DOTA demonstrate the effectiveness of our approach. Our source code, models and splits are available here https://github.com/sahalshajim/SS-OWFormer



Paperid:477
Authors:Geraldin Nanfack, Alexander Fulleringer, Jonathan Marty, Michael Eickenberg, Eugene Belilovsky
Concordia University Mila – Quebec AI Institute, Concordia University Mila – Quebec AI Institute, Princeton University, Flatiron Institute, Concordia University Mila – Quebec AI Institute
Abstract:
Feature visualization is one of the most popular techniques used to interpret the internal behavior of individual units of trained deep neural networks. Based on activation maximization, they consist of finding synthetic or natural inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of finetuning a pretrained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the classification task with ImageNet.



Paperid:478
Authors:Zhangkai Ni, Peiqi Yang, Wenhan Yang, Hanli Wang, Lin Ma, Sam Kwong
Tongji University, Tongji University, Peng Cheng Laboratory, Tongji University, Meituan, City Univeristy of Hong Kong
Abstract:
Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccurate. In our work, we introduce a novel model: the Collaborative Neural Radiance Fields (ColNeRF) designed to work with sparse input. The collaboration in ColNeRF includes the cooperation among sparse input source images and the cooperation among the output of the NeRF. Through this, we construct a novel collaborative module that aligns information from various views and meanwhile imposes selfsupervised constraints to ensure multi-view consistency in both geometry and appearance. A Collaborative Cross-View Volume Integration module (CCVI) is proposed to capture complex occlusions and implicitly infer the spatial location of objects. Moreover, we introduce self-supervision of target rays projected in multiple directions to ensure geometric and color consistency in adjacent regions. Benefiting from the collaboration at the input and output ends, ColNeRF is capable of capturing richer and more generalized scene representation, thereby facilitating higher-quality results of the novel view synthesis. Our extensive experimental results demonstrate that ColNeRF outperforms state-of-the-art sparse input generalizable NeRF methods. Furthermore, our approach exhibits superiority in fine-tuning towards adapting to new scenes, achieving competitive performance compared to per-scene optimized NeRF-based methods while significantly reducing computational costs. Our code is available at: https://github.com/eezkni/ColNeRF.



Paperid:479
Authors:Xuesong Nie, Yunfeng Yan, Siyuan Li, Cheng Tan, Xi Chen, Haoyuan Jin, Zhihang Zhu, Stan Z. Li, Donglian Qi
Zhejiang University, Zhejiang University, Zhejiang University Westlake University, Zhejiang University Westlake University, The University of Hong Kong, Zhejiang University, Zhejiang University, Westlake University, Zhejiang University
Abstract:
Spatiotemporal predictive learning is a paradigm that empowers models to learn spatial and temporal patterns by predicting future frames from past frames in an unsupervised manner. This method typically uses recurrent units to capture longterm dependencies, but these units often come with high computational costs and limited performance in real-world scenes. This paper presents an innovative Wavelet-based SpatioTemporal (WaST) framework, which extracts and adaptively controls both low and high-frequency components at image and feature levels via 3D discrete wavelet transform for faster processing while maintaining high-quality predictions. We propose a Time-Frequency Aware Translator uniquely crafted to efficiently learn short- and long-range spatiotemporal information by individually modeling spatial frequency and temporal variations. Meanwhile, we design a wavelet-domain High-Frequency Focal Loss that effectively supervises high-frequency variations. Extensive experiments across various real-world scenarios, such as driving scene prediction, traffic flow prediction, human motion capture, and weather forecasting, demonstrate that our proposed WaST achieves state-of-the-art performance over various spatiotemporal prediction methods.



Paperid:480
Authors:Li Niu, Junyan Cao, Yan Hong, Liqing Zhang
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Given a composite image with photographic object and painterly background, painterly image harmonization targets at stylizing the composite object to be compatible with the background. Despite the competitive performance of existing painterly harmonization works, they did not fully leverage the painterly objects in artistic paintings. In this work, we explore learning from painterly objects for painterly image harmonization. In particular, we learn a mapping from background style and object information to object style based on painterly objects in artistic paintings. With the learnt mapping, we can hallucinate the target style of composite object, which is used to harmonize encoder feature maps to produce the harmonized image. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our proposed method.



Paperid:481
Authors:Li Niu, Yan Hong, Junyan Cao, Liqing Zhang
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Painterly image harmonization aims to harmonize a photographic foreground object on the painterly background. Different from previous autoencoder based harmonization networks, we develop a progressive multi-stage harmonization network, which harmonizes the composite foreground from low-level styles (e.g., color, simple texture) to high-level styles (e.g., complex texture). Our network has better interpretability and harmonization performance. Moreover, we design an early-exit strategy to automatically decide the proper stage to exit, which can skip the unnecessary and even harmful late stages. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our progressive harmonization network.



Paperid:482
Authors:Minyoung Oh, Duhyun Kim, Jae-Young Sim
Ulsan National Institute of Science and Technology, Ulsan National Institute of Science and Technology, Ulsan National Institute of Science and Technology
Abstract:
Collecting and labeling real datasets to train the person search networks not only requires a lot of time and effort, but also accompanies privacy issues. The weaklysupervised and unsupervised domain adaptation methods have been proposed to alleviate the labeling burden for target datasets, however, their generalization capability is limited. We introduce a novel person search method based on the domain generalization framework, that uses an automatically labeled unreal dataset only for training but is applicable to arbitrary unseen real datasets. To alleviate the domain gaps when transferring the knowledge from the unreal source dataset to the real target datasets, we estimate the fidelity of person instances which is then used to train the end-to-end network adaptively. Moreover, we devise a domain-invariant feature learning scheme to encourage the network to suppress the domain-related features. Experimental results demonstrate that the proposed method provides the competitive performance to existing person search methods even though it is applicable to arbitrary unseen datasets without any prior knowledge and re-training burdens.



Paperid:483
Authors:Wenzhe Ouyang, Xiaolin Song, Bailan Feng, Zenglin Xu
Harbin Institute of Technology, Shenzhen, Guandong, China, Huawei Noah's Ark Lab, Beijing, China, Huawei Noah's Ark Lab, Beijing, China, Harbin Institute of Technology, Shenzhen, Guandong, China Peng Cheng Lab, Shenzhen, Guandong, China
Abstract:
3D semantic occupancy has garnered considerable attention due to its abundant structural information encompassing the entire scene in autonomous driving. However, existing 3D occupancy prediction methods contend with the constraint of lowresolution 3D voxel features arising from the limitation of computational memory. To address this limitation and achieve a more fine-grained representation of 3D scenes, we propose OctOcc, a novel octree-based approach for 3D semantic occupancy prediction. OctOcc is conceptually rooted in the observation that the vast majority of 3D space is left unoccupied. Capitalizing on this insight, we endeavor to cultivate memory-efficient high-resolution 3D occupancy predictions by mitigating superfluous cross-attentions. Specifically, we devise a hierarchical octree structure that selectively generates finer-grained cross-attentions solely in potentially occupied regions. Extending our inquiry beyond 3D space, we identify analogous redundancies within another side of cross attentions, 2D images. Consequently, a 2D image feature filtering network is conceived to expunge extraneous regions. Experimental results demonstrate that the proposed OctOcc significantly outperforms existing methods on nuScenes and SemanticKITTI datasets with limited memory consumption.



Paperid:484
Authors:Parth Padalkar, Huaduo Wang, Gopal Gupta
The University of Texas at Dallas, The University of Texas at Dallas, The University of Texas at Dallas
Abstract:
Deep learning models such as CNNs have surpassed human performance in computer vision tasks such as image classification. However, despite their sophistication, these models lack interpretability which can lead to biased outcomes re- flecting existing prejudices in the data. We aim to make pre- dictions made by a CNN interpretable. Hence, we present a novel framework called NeSyFOLD to create a neurosym- bolic (NeSy) model for image classification tasks. The model is a CNN with all layers following the last convolutional layer replaced by a stratified answer set program (ASP) derived from the last layer kernels. The answer set program can be viewed as a rule-set, wherein the truth value of each pred- icate depends on the activation of the corresponding kernel in the CNN. The rule-set serves as a global explanation for the model and is interpretable. We also use our NeSyFOLD framework with a CNN that is trained using a sparse kernel learning technique called Elite BackProp (EBP). This leads to a significant reduction in rule-set size without compromising accuracy or fidelity thus improving scalability of the NeSy model and interpretability of its rule-set. Evaluation is done on datasets with varied complexity and sizes. We also pro- pose a novel algorithm for labelling the predicates in the rule- set with meaningful semantic concept(s) learnt by the CNN. We evaluate the performance of our “semantic labelling algo- rithm” to quantify the efficacy of the semantic labelling for both the NeSy model and the NeSy-EBP model.



Paperid:485
Authors:Wensheng Pan, Timin Gao, Yan Zhang, Xiawu Zheng, Yunhang Shen, Ke Li, Runze Hu, Yutao Liu, Pingyang Dai
Xiamen University, Xiamen University, Xiamen University, Peng Cheng Laboratory, Tencent, Tencent, Beijing Institute of Technology, Ocean University of China, Xiamen University
Abstract:
Blind Image Quality Assessment (BIQA) aims to simulate human assessment of image quality. It has a great demand for labeled data, which is often insufficient in practice. Some researchers employ unsupervised methods to address this issue, which is challenging to emulate the human subjective system. To this end, we introduce a unified framework that combines semisupervised and incremental learning to address the mentioned issue. Specifically, when training data is limited, semi-supervised learning is necessary to infer extensive unlabeled data. To facilitate semi-supervised learning, we use knowledge distillation to assign pseudo-labels to unlabeled data, preserving analytical capability. To gradually improve the quality of pseudo labels, we introduce incremental learning. However, incremental learning can lead to catastrophic forgetting. We employ Experience Replay by selecting representative samples during multiple rounds of semi-supervised learning, to alleviate forgetting and ensure model stability. Experimental results show that the proposed approach achieves state-of-the-art performance across various benchmark datasets. After being trained on the LIVE dataset, our method can be directly transferred to the CSIQ dataset. Compared with other methods, it significantly outperforms unsupervised methods on the CSIQ dataset with a marginal performance drop (-0.002) on the LIVE dataset. In conclusion, our proposed method demonstrates its potential to tackle the challenges in real-world production processes.



Paperid:486
Authors:Zhiyi Pan, Nan Zhang, Wei Gao, Shan Liu, Ge Li
SECE, Shenzhen Graduate School, Peking University Peng Cheng Laboratory, SECE, Shenzhen Graduate School, Peking University, SECE, Shenzhen Graduate School, Peking University, Media Laboratory, Tencent, SECE, Shenzhen Graduate School, Peking University
Abstract:
Weak supervision has proven to be an effective strategy for reducing the burden of annotating semantic segmentation tasks in 3D space. However, unconstrained or heuristic weakly supervised annotation forms may lead to suboptimal label efficiency. To address this issue, we propose a novel label recommendation framework for weakly supervised point cloud semantic segmentation. Distinct from pretraining and active learning, the label recommendation framework consists of three stages: inductive bias learning, recommendations for points to be labeled, and point cloud semantic segmentation learning. In practice, we first introduce the point cloud upsampling task to induct inductive bias from structural information. During the recommendation stage, we present a cross-scene clustering strategy to generate centers of clustering as recommended points. Then we introduce a recommended point positions attention module LabelAttention to model the long-range dependency under sparse annotations. Additionally, we employ position encoding to enhance the spatial awareness of semantic features. Throughout the framework, the useful information obtained from inductive bias learning is propagated to subsequent semantic segmentation networks in the form of label positions. Experimental results demonstrate that our framework outperforms weakly supervised point cloud semantic segmentation methods and other methods for labeling efficiency on S3DIS and ScanNetV2, even at an extremely low label rate.



Paperid:487
Authors:Zirui Pan, Mengbai Xiao, Xu Han, Dongxiao Yu, Guanghui Zhang, Yao Liu
Shandong University, Shandong University, Shandong University, Shandong University, Shandong University, Rutgers University
Abstract:
When compressing point clouds, pointbased deep learning models operate points in a continuous space, which has a chance to minimize the geometric fidelity loss introduced by voxelization in preprocessing. But these methods could hardly scale to inputs with arbitrary points. Furthermore, the point cloud frames are individually compressed, failing the conventional wisdom of leveraging inter-frame similarity. In this work, we propose a patchwise compression framework called patchDPCC, which consists of a patch group generation module and a point-based compression model. Algorithms are developed to generate patches from different frames representing the same object, and more importantly, these patches are regulated to have the same number of points. We also incorporate a feature transfer module in the compression model, which refines the feature quality by exploiting the inter-frame similarity. Our model generates point-wise features for entropy coding, which guarantees the reconstruction speed. The evaluation on the MPEG 8i dataset shows that our method improves the compression ratio by 47.01% and 85.22% when compared to PCGCv2 and V-PCC with the same reconstruction quality, which is 9% and 16% better than that D-DPCC does. Our method also achieves the fastest decoding speed among the learning-based compression models.



Paperid:488
Authors:Atharva Pandey, Vishal Yadav, Rajendra Nagar, Santanu Chaudhury
Indian Institute of Technology, Jodhpur, Indian Institute of Technology, Jodhpur, Indian Institute of Technology, Jodhpur, Indian Institute of Technology, Jodhpur
Abstract:
Implicit 3D surface reconstruction of an object from its partial and noisy 3D point cloud scan is the classical geometry processing and 3D computer vision problem. In the literature, various 3D shape representations have been developed, differing in memory efficiency and shape retrieval effectiveness, such as volumetric, parametric, and implicit surfaces. Radial basis functions provide memoryefficient parameterization of the implicit surface. However, we show that training a neural network using the mean squared error between the ground-truth implicit surface and the linear basis-based implicit surfaces does not converge to the global solution. In this work, we propose locally supported compact radial basis functions for a linear representation of the implicit surface. This representation enables us to generate 3D shapes with arbitrary topologies at any resolution due to their continuous nature. We then propose a neural network architecture for learning the linear implicit shape representation of the 3D surface of an object. We learn linear implicit shapes within a supervised learning framework using ground truth Signed-Distance Field (SDF) data for guidance. The classical strategies face difficulties in finding linear implicit shapes from a given 3D point cloud due to numerical issues (requires solving inverse of a large matrix) in basis and query point selection. The proposed approach achieves better Chamfer distance and comparable F-score than the state-of-the-art approach on the benchmark dataset. We also show the effectiveness of the proposed approach by using it for the 3D shape completion task.



Paperid:489
Authors:Changsong Pang, Xieyuanli Chen, Yimin Liu, Huimin Lu, Yuwei Cheng
Northwestern Polytechnical University ORCA-UBOAT, College of Intelligence Science and Technology, National University of Defense Technology, Tsinghua University, College of Intelligence Science and Technology, National University of Defense Technology, Tsinghua University ORCA-UBOAT
Abstract:
Moving object segmentation (MOS) and Ego velocity estimation (EVE) are vital capabilities for mobile systems to achieve full autonomy. Several approaches have attempted to achieve MOSEVE using a LiDAR sensor. However, LiDAR sensors are typically expensive and susceptible to adverse weather conditions. Instead, millimeterwave radar (MWR) has gained popularity in robotics and autonomous driving for real applications due to its cost-effectiveness and resilience to bad weather. Nonetheless, publicly available MOSEVE datasets and approaches using radar data are limited. Some existing methods adopt point convolutional networks from LiDAR-based approaches, ignoring the specific artifacts and the valuable radial velocity information of radar measurements, leading to suboptimal performance. In this paper, we propose a novel transformer network that effectively addresses the sparsity and noise issues and leverages the radial velocity measurements of radar points using our devised radar self- and cross-attention mechanisms. Based on that, our method achieves accurate EVE of the robot and performs MOS using only radar data simultaneously. To thoroughly evaluate the MOSEVE performance of our method, we annotated the radar points in the public View-of-Delft (VoD) dataset and additionally constructed a new radar dataset in various environments. The experimental results demonstrate the superiority of our approach over existing state-of-the-art methods. The code is available at https://github.com/ORCAUboat/RadarMOSEVE.



Paperid:490
Authors:Sihwa Park, Seongjun Kim, Doeyoung Kwon, Yohan Jang, In-Seok Song, Seung Jun Baek
Korea University, Korea University, Korea University, Korea University, Korea University Anam Hospital, Korea University
Abstract:
Panoramic radiography (Panoramic Xray, PX) is a widely used imaging modality for dental examination. However, PX only provides a flattened 2D image, lacking in a 3D view of the oral structure. In this paper, we propose NeBLa (Neural Beer-Lambert) to estimate 3D oral structures from real-world PX. NeBLa tackles full 3D reconstruction for varying subjects (patients) where each reconstruction is based only on a single panoramic image. We create an intermediate representation called simulated PX (SimPX) from 3D Cone-beam computed tomography (CBCT) data based on the Beer-Lambert law of X-ray rendering and rotational principles of PX imaging. SimPX aims at not only truthfully simulating PX, but also facilitates the reverting process back to 3D data. We propose a novel neural model based on ray tracing which exploits both global and local input features to convert SimPX to 3D output. At inference, a real PX image is translated to a SimPX-style image with semantic regularization, and the translated image is processed by generation module to produce high-quality outputs. Experiments show that NeBLa outperforms prior state-of-the-art in reconstruction tasks both quantitatively and qualitatively. Unlike prior methods, NeBLa does not require any prior information such as the shape of dental arches, nor the matched PX-CBCT dataset for training, which is difficult to obtain in clinical practice. Our code is available at https://github.com/sihwa-park/nebla.



Paperid:491
Authors:Suho Park, SuBeen Lee, Sangeek Hyun, Hyun Seok Seong, Jae-Pil Heo
Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University
Abstract:
Fewshot segmentation aims to accurately segment novel target objects within query images using only a limited number of annotated support images. The recent works exploit support background as well as its foreground to precisely compute the dense correlations between query and support. However, they overlook the characteristics of the background that generally contains various types of objects. In this paper, we highlight this characteristic of background which can bring problematic cases as follows: (1) when the query and support backgrounds are dissimilar and (2) when objects in the support background are similar to the target object in the query. Without any consideration of the above cases, adopting the entire support background leads to a misprediction of the query foreground as background. To address this issue, we propose Task-disruptive Background Suppression(TBS), a module to suppress those disruptive support background features based on two spatial-wise scores: query-relevant and target-relevant scores. The former aims to mitigate the impact of unshared features solely existing in the support background, while the latter aims to reduce the influence of target-similar support background features. Based on these two scores, we define a query background relevant score that captures the similarity between the backgrounds of the query and the support, and utilize it to scale support background features to adaptively restrict the impact of disruptive support backgrounds. Our proposed method achieves state-of-the-art performance on standard few-shot segmentation benchmarks. Our official code is available at github.com/SuhoPark0706/TBSNet.



Paperid:492
Authors:Wenjie Pei, Tongqi Xia, Fanglin Chen, Jinsong Li, Jiandong Tian, Guangming Lu
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Shenzhen Jiang & Associates Creative Design Co., Ltd, CAS, Harbin Institute of Technology, Shenzhen
Abstract:
As a prominent parameterefficient fine-tuning technique in NLP, prompt tuning is being explored its potential in computer vision. Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representation as the visual prompts for task adaptation of large vision models. While such sequential modeling paradigm of visual prompt has shown great promise, there are two potential limitations. First, the learned visual prompts cannot model the underlying spatial relations in the input image, which is crucial for image encoding. Second, since all prompt tokens play the same role of prompting for all image tokens without distinction, it lacks the fine-grained prompting capability, i.e., individual prompting for different image tokens. In this work, we propose the Spatially Aligned-and-Adapted Visual Prompt model (SA^2VP), which learns a two-dimensional prompt token map with equal (or scaled) size to the image token map, thereby being able to spatially align with the image map. Each prompt token is designated to prompt knowledge only for the spatially corresponding image tokens. As a result, our model can conduct individual prompting for different image tokens in a fine-grained manner. Moreover, benefiting from the capability of preserving the spatial structure by the learned prompt token map, our SA^2VP is able to model the spatial relations in the input image, leading to more effective prompting. Extensive experiments on three challenging benchmarks for image classification demonstrate the superiority of our model over other state-of-the-art methods for visual prompt tuning. Code is available at https://github.com/tommy-xq/SA2VP.



Paperid:493
Authors:Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory
Abstract:
Recent works have successfully extended largescale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.



Paperid:494
Authors:Dezhi Peng, Chongyu Liu, Yuliang Liu, Lianwen Jin
South China University of Technology INTSIG-SCUT Joint Lab of Document Image Analysis and Recognition, South China University of Technology, Huazhong University of Science and Technology, South China University of Technology SCUT-Zhuhai Institute of Modern Industrial Innovation INTSIG-SCUT Joint Lab of Document Image Analysis and Recognition
Abstract:
Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds. Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization. Moreover, most existing STR methods adopt convolutional architectures while the potential of vision Transformers (ViTs) remains largely unexplored. In this paper, we propose a simpleyet-effective ViT-based text eraser, dubbed ViTEraser. Following a concise encoder-decoder framework, ViTEraser can easily incorporate various ViTs to enhance long-range modeling. Specifically, the encoder hierarchically maps the input image into the hidden space through ViT blocks and patch embedding layers, while the decoder gradually upsamples the hidden features to the text-erased image with ViT blocks and patch splitting layers. As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively. Experimental results demonstrate that ViTEraser with SegMIM achieves state-of-the-art performance on STR by a substantial margin and exhibits strong generalization ability when extended to other tasks, e.g., tampered scene text detection. Furthermore, we comprehensively explore the architecture, pretraining, and scalability of the ViT-based encoder-decoder for STR, which provides deep insights into the application of ViT to the STR field. Code is available at https://github.com/shannanyinxiang/ViTEraser.



Paperid:495
Authors:Jinlong Peng, Zekun Luo, Liang Liu, Boshen Zhang
Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab
Abstract:
Image harmonization aims to generate a more realistic appearance of foreground and background for a composite image. All the existing methods perform the same harmonization process for the whole foreground. However, the implanted foreground always contains different appearance patterns. Existing solutions ignore the difference of each color block and lose some specific details. Therefore, we propose a novel globallocal two stages framework for Fine-grained Region-aware Image Harmonization (FRIH). In the first stage, the whole input foreground mask is used to make a global coarse-grained harmonization. In the second stage, we adaptively cluster the input foreground mask into several submasks. Each submask and the coarsely adjusted image are concatenated respectively and fed into a lightweight cascaded module, refining the global harmonization result. Moreover, we further design a fusion prediction module to generate the final result, utilizing the different degrees of harmonization results comprehensively. Without bells and whistles, our FRIH achieves a competitive performance on iHarmony4 dataset with a lightweight model.



Paperid:496
Authors:Kunyu Peng, Cheng Yin, Junwei Zheng, Ruiping Liu, David Schneider, Jiaming Zhang, Kailun Yang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg
Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, School of Robotics, Hunan University, Mercdes-Benz Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Institute for Artificial Intelligence, University of Stuttgart
Abstract:
In realworld scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information.To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. We will release the benchmark, code, and models to the community.



Paperid:497
Authors:Renyuan Peng, Xinyue Cai, Hang Xu, Jiachen Lu, Feng Wen, Wei Zhang, Li Zhang
Fudan University, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Fudan University, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Fudan University
Abstract:
Understanding road structures is crucial for autonomous driving. Intricate road structures are often depicted using lane graphs, which include centerline curves and connections forming a Directed Acyclic Graph (DAG). Accurate extraction of lane graphs relies on precisely estimating vertex and edge information within the DAG. Recent research highlights Transformerbased language models' impressive sequence prediction abilities, making them effective for learning graph representations when graph data are encoded as sequences. However, existing studies focus mainly on modeling vertices explicitly, leaving edge information simply embedded in the network. Consequently, these approaches fall short in the task of lane graph extraction. To address this, we introduce LaneGraph2Seq, a novel approach for lane graph extraction. It leverages a language model with vertex-edge encoding and connectivity enhancement. Our serialization strategy includes a vertex-centric depth-first traversal and a concise edge-based partition sequence. Additionally, we use classifier-free guidance combined with nucleus sampling to improve lane connectivity. We validate our method on prominent datasets, nuScenes and Argoverse 2, showcasing consistent and compelling results. Our LaneGraph2Seq approach demonstrates superior performance compared to state-of-the-art techniques in lane graph extraction.



Paperid:498
Authors:Wenshuo Peng, Kaipeng Zhang, Yue Yang, Hao Zhang, Yu Qiao
Shanghai AI Laboratory, Shanghai AI Laboratory, Shanghai AI Laboratory Shanghai Jiao Tong University, Shanghai AI Laboratory Xi'an Jiaotong University, Shanghai AI Laboraotry
Abstract:
Visionlanguage foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data. The existing adaptation methods do not consider the missing knowledge, which may lead to crucial task-related knowledge for the downstream tasks being ignored. To address this issue, we propose a new adaptation framework called Data Adaptive Traceback (DAT). Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data to enable the downstream tasks. Furthermore, we adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning. We conduct extensive experiments that show our proposed DAT approach meaningfully improves various benchmark datasets’ performance over traditional adaptation methods by simply.



Paperid:499
Authors:Zelin Peng, Zhengqin Xu, Zhilin Zeng, Xiaokang Yang, Wei Shen
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Segment Anything Model (SAM) has received remarkable attention as it offers a powerful and versatile solution for object segmentation in images. However, finetuning SAM for downstream segmentation tasks under different scenarios remains a challenge, as the varied characteristics of different scenarios naturally requires diverse model parameter spaces. Most existing fine-tuning methods attempt to bridge the gaps among different scenarios by introducing a set of new parameters to modify SAM's original parameter space. Unlike these works, in this paper, we propose fine-tuning SAM efficiently by parameter space reconstruction (SAM-PARSER), which introduce nearly zero trainable parameters during fine-tuning. In SAM-PARSER, we assume that SAM's original parameter space is relatively complete, so that its bases are able to reconstruct the parameter space of a new scenario. We obtain the bases by matrix decomposition, and fine-tuning the coefficients to reconstruct the parameter space tailored to the new scenario by an optimal linear combination of the bases. Experimental results show that SAM-PARSER exhibits superior segmentation performance across various scenarios, while reducing the number of trainable parameters by approximately 290 times compared with current parameter-efficient fine-tuning methods.



Paperid:500
Authors:Yayun Qi, Wentian Zhao, Xinxiao Wu
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology Shenzhen MSU-BIT University
Abstract:
Unsupervised image captioning aims to generate descriptions of images without relying on any imagesentence pairs for training. Most existing works use detected visual objects or concepts as bridge to connect images and texts. Considering that the relationship between objects carries more information, we use the object relationship as a more accurate connection between images and texts. In this paper, we adapt the idea of distant supervision that extracts the knowledge about object relationships from an external corpus and imparts them to images to facilitate inferring visual object relationships, without introducing any extra pre-trained relationship detectors. Based on these learned informative relationships, we construct pseudo image-sentence pairs for captioning model training. Specifically, our method consists of three modules: (1) a relationship learning module that learns to infer relationships from images under the distant supervision; (2) a relationship-to-sentence module that transforms the inferred relationships into sentences to generate pseudo image-sentence pairs; (3) an image captioning module that is trained by using the generated image-sentence pairs. Promising results on three datasets show that our method outperforms the state-of-the-art methods of unsupervised image captioning.



Paperid:501
Authors:Zhaobo Qi, Yibo Yuan, Xiaowen Ruan, Shuhui Wang, Weigang Zhang, Qingming Huang
Harbin Institute of Technology, Weihai, China, Harbin Institute of Technology, Weihai, China, Harbin Institute of Technology, Weihai, China, Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, China, Harbin Institute of Technology, Weihai, China, University of Chinese Academy of Sciences, Beijing, China
Abstract:
Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort to utilizing prior knowledge about bias to artificially break this uneven distribution, which only removes a limited amount of significant language biases. In this work, we propose the biasconflict sample synthesis and adversarial removal debias strategy (BSSARD), which dynamically generates bias-conflict samples by explicitly leveraging potentially spurious correlations between single-modality features and the temporal position of the target moments. Through adversarial training, its bias generators continuously introduce biases and generate bias-conflict samples to deceive its grounding model. Meanwhile, the grounding model continuously eliminates the introduced biases, which requires it to model multi-modality alignment information. BSSARD will cover most kinds of coupling relationships and disrupt language and visual biases simultaneously. Extensive experiments on Charades-CD and ActivityNet-CD demonstrate the promising debiasing capability of BSSARD. Source codes are available at https://github.com/qzhb/BSSARD.



Paperid:502
Authors:Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang
Academy for Engineering and Technology, Fudan University, Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University
Abstract:
We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on streetview clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.



Paperid:503
Authors:Zhipeng Qian, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Abstract:
Referring 3D instance segmentation is a challenging task aimed at accurately segmenting a target instance within a 3D scene based on a given referring expression. However, previous methods have overlooked the distinct roles played by different words in referring expressions. Additionally, they have failed to incorporate the positional relationship within referring expressions with the spatial correlations in 3D scenes. To alleviate these issues, we present a novel model called XRefSeg3D, which constructs a cross-modal graph for the input 3D scene and unites textual and spatial relationships for reasoning via graph neural networks. Our approach begins by capturing object-specific text features, which are then fused with the instance features to construct a comprehensive cross-modal scene graph. Subsequently, we integrate the obtained cross-modal features into graph neural networks, leveraging the K-nearest algorithm to derive explicit instructions from expressions and factual relationships in scenes. This enables the effective capture of higher-order relationships among instances, thereby enhancing feature fusion and facilitating reasoning. Finally, the refined feature undergoes a matching module to compute the ultimate matching score. Experimental results on ScanRefer demonstrate the effectiveness of our method, surpassing previous approaches by a substantial margin of +3.67% in terms of mIOU.



Paperid:504
Authors:Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu, Siyu Wu, Guo-Jun Qi
Tsinghua University OPPO Research Institute, OPPO Research Institute, OPPO Research Institute, OPPO Research Institute, OPPO Research Institute, Zhejiang University, OPPO Research Institute Westlake University
Abstract:
Image editing approaches with diffusion models have been rapidly developed, yet their applicability are subject to requirements such as specific editing types (e.g., foreground or background object editing, style transfer), multiple conditions (e.g., mask, sketch, caption), and time consuming finetuning of diffusion models. For alleviating these limitations and realizing efficient real image editing, we propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model. Our method contains three novelties: (I) Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence. (II) Progressive Transition Scheme applies progressive linear interpolation between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability. (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics. By the means of combining self-attention map from reconstruction process and cross-attention map from transition process, the guidance of target text embeddings in diffusion process is optimized. In order to demonstrate editing capability, effectiveness and efficiency of the proposed BARET, we have conducted extensive qualitative and quantitative experiments. Moreover, results derived from user study and ablation study further prove the superiority over other methods.



Paperid:505
Authors:Minghan Qin, Yifan Liu, Yuelang Xu, Xiaochen Zhao, Yebin Liu, Haoqian Wang
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua Shenzhen International Graduate School, Tsinghua University
Abstract:
One crucial aspect of 3D head avatar reconstruction lies in the details of facial expressions. Although recent NeRFbased photo-realistic 3D head avatar methods achieve high-quality avatar rendering, they still encounter challenges retaining intricate facial expression details because they overlook the potential of specific expression variations at different spatial positions when conditioning the radiance field. Motivated by this observation, we introduce a novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained by a simple MLP-based generation network, encompassing both spatial positional features and global expression information. Benefiting from rich and diverse information of the SVE at different positions, the proposed SVE-conditioned NeRF can deal with intricate facial expressions and achieve realistic rendering and geometry details of high-fidelity 3D head avatars. Additionally, to further elevate the geometric and rendering quality, we introduce a new coarse-to-fine training strategy, including a geometry initialization strategy at the coarse stage and an adaptive importance sampling strategy at the fine stage. Extensive experiments indicate that our method outperforms other state-of-the-art (SOTA) methods in rendering and geometry quality on mobile phone-collected and public datasets. Code and data can be found at https://github.com/minghanqin/AvatarSVE.



Paperid:506
Authors:Yiming Qin, Nanxuan Zhao, Bin Sheng, Rynson W.H. Lau
Shanghai Jiao Tong University City University of Hong Kong, Adobe Research, Shanghai Jiao Tong University, City University of Hong Kong
Abstract:
Regenerating urban layout is an essential process for urban regeneration. In this paper, we propose a new task called textdriven urban layout regeneration, which provides an intuitive input modal - text - for users to specify the regeneration, instead of designing complex rules. Given the target region to be regenerated, we propose a one-stage text-driven urban layout regeneration model, Text2City, to jointly and progressively regenerate the urban layout (i.e., road and building layouts) based on textual layout descriptions and surrounding context (i.e., urban layouts and functions of the surrounding regions). Text2City first extracts road and building attributes from the textual layout description to guide the regeneration. It includes a novel one-stage joint regenerator network based on the conditioned denoising diffusion probabilistic models (DDPMs) and prior knowledge exchange. To harmonize the regenerated layouts through joint optimization, we propose the interactive & enhanced guidance module for self-enhancement and prior knowledge exchange between road and building layouts during the regeneration. We also design a series of constraints from attribute-, geometry- and pixel-levels to ensure rational urban layout generation. To train our model, we build a large-scale dataset containing urban layouts and layout descriptions, covering 147K regions. Qualitative and quantitative evaluations show that our proposed method outperforms the baseline methods in regenerating desirable urban layouts that meet the textual descriptions.



Paperid:507
Authors:Changqing Qiu, Fusheng Jin, Yining Zhang
Beijing Institute of Technology, Beijing Institute of Technology, Peking University
Abstract:
Recently, the explanation of neural network models has garnered considerable research attention. In computer vision, CAM (Class Activation Map)based methods and LRP (Layer-wise Relevance Propagation) method are two common explanation methods. However, since most CAM-based methods can only generate global weights, they can only generate coarse-grained explanations at a deep layer. LRP and its variants, on the other hand, can generate fine-grained explanations. But the faithfulness of the explanations is too low. To address these challenges, in this paper, we propose FG-CAM (Fine-Grained CAM), which extends CAM-based methods to enable generating fine-grained and high-faithfulness explanations. FG-CAM uses the relationship between two adjacent layers of feature maps with resolution differences to gradually increase the explanation resolution, while finding the contributing pixels and filtering out the pixels that do not contribute. Our method not only solves the shortcoming of CAM-based methods without changing their characteristics, but also generates fine-grained explanations that have higher faithfulness than LRP and its variants. We also present FG-CAM with denoising, which is a variant of FG-CAM and is able to generate less noisy explanations with almost no change in explanation faithfulness. Experimental results show that the performance of FG-CAM is almost unaffected by the explanation resolution. FG-CAM outperforms existing CAM-based methods significantly in both shallow and intermediate layers, and outperforms LRP and its variants significantly in the input layer. Our code is available at https://github.com/dongmo-qcq/FG-CAM.



Paperid:508
Authors:Liuxiang Qiu, Si Chen, Yan Yan, Jing-Hao Xue, Da-Han Wang, Shunzhi Zhu
Xiamen University, China Xiamen University of Technology, China, Xiamen University of Technology, China, Xiamen University, China, University College London, UK, Xiamen University of Technology, China, Xiamen University of Technology, China
Abstract:
Visibleinfrared person re-identification (VI-ReID) aims to retrieve images of the same persons captured by visible (VIS) and infrared (IR) cameras. Existing VI-ReID methods ignore high-order structure information of features while being relatively difficult to learn a reasonable common feature space due to the large modality discrepancy between VIS and IR images. To address the above problems, we propose a novel high-order structure based middle-feature learning network (HOS-Net) for effective VI-ReID. Specifically, we first leverage a short- and long-range feature extraction (SLE) module to effectively exploit both short-range and long-range features. Then, we propose a high-order structure learning (HSL) module to successfully model the high-order relationship across different local features of each person image based on a whitened hypergraph network. This greatly alleviates model collapse and enhances feature representations. Finally, we develop a common feature space learning (CFL) module to learn a discriminative and reasonable common feature space based on middle features generated by aligning features from different modalities and ranges. In particular, a modality-range identity-center contrastive (MRIC) loss is proposed to reduce the distances between the VIS, IR, and middle features, smoothing the training process. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that our HOS-Net achieves superior state-of-the-art performance. Our code is available at https://github.com/Jaulaucoeng/HOS-Net.



Paperid:509
Authors:Longtian Qiu, Shan Ning, Xuming He
ShanghaiTech University, ShanghaiTech University, ShanghaiTech University Shanghai Engineering Research Center of Intelligent Vision and Imaging
Abstract:
Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of visionlanguage applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.



Paperid:510
Authors:Zexuan Qiu, Jiahong Liu, Yankai Chen, Irwin King
The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong
Abstract:
Existing unsupervised deep product quantization methods primarily aim for the increased similarity between different views of the identical image, whereas the delicate multilevel semantic similarities preserved between images are overlooked. Moreover, these methods predominantly focus on the Euclidean space for computational convenience, compromising their ability to map the multi-level semantic relationships between images effectively. To mitigate these shortcomings, we propose a novel unsupervised product quantization method dubbed Hierarchical Hyperbolic Product Quantization (HiHPQ), which learns quantized representations by incorporating hierarchical semantic similarity within hyperbolic geometry. Specifically, we propose a hyperbolic product quantizer, where the hyperbolic codebook attention mechanism and the quantized contrastive learning on the hyperbolic product manifold are introduced to expedite quantization. Furthermore, we propose a hierarchical semantics learning module, designed to enhance the distinction between similar and non-matching images for a query by utilizing the extracted hierarchical semantics as an additional training supervision. Experiments on benchmark image datasets show that our proposed method outperforms state-of-the-art baselines.



Paperid:511
Authors:Jiahui Qu, Jie He, Wenqian Dong, Jingyu Zhao
State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China
Abstract:
Hyperspectral image superresolution (HISR) is a technique that can break through the limitation of imaging mechanism to obtain the hyperspectral image (HSI) with high spatial resolution. Although some progress has been achieved by existing methods, most of them directly learn the spatial-spectral joint mapping between the observed images and the target high-resolution HSI (HrHSI), failing to fully reserve the spectral distribution of low-resolution HSI (LrHSI) and the spatial distribution of high-resolution multispectral imagery (HrMSI). To this end, we propose a spatial-spectral-bilateral cycle-diffusion framework (S2CycleDiff) for HISR, which can step-wise generate the HrHSI with high spatial-spectral fidelity by learning the conditional distribution of spatial and spectral super-resolution processes bilaterally. Specifically, a customized conditional cycle-diffusion framework is designed as the backbone to achieve the spatial-spectral-bilateral super-resolution by repeated refinement, wherein the spatial/spectral guided pyramid denoising (SGPD) module seperately takes HrMSI and LrHSI as the guiding factors to achieve the spatial details injection and spectral correction. The outputs of the conditional cycle-diffusion framework are fed into a complementary fusion block to integrate the spatial and spectral details to generate the desired HrHSI. Experiments have been conducted on three widely used datasets to demonstrate the superiority of the proposed method over state-of-the-art HISR methods. The code is available at https://github.com/Jiahuiqu/S2CycleDiff.



Paperid:512
Authors:Qiang Qu, Yiran Shen, Xiaoming Chen, Yuk Ying Chung, Tongliang Liu
The University of Sydney, Shandong University, Beijing Technology and Business University, The University of Sydney, The University of Sydney
Abstract:
The bioinspired event cameras or dynamic vision sensors are capable of asynchronously capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. However, the non-structural spatial-temporal event-streams make it challenging for providing intuitive visualization with rich semantic information for human vision. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. However, current solutions are predominantly data-driven without considering the prior knowledge of the underlying statistics relating event-streams and video frames. It highly relies on the non-linearity and generalization capability of the deep neural networks, thus, is struggling on reconstructing detailed textures when the scenes are complex. In this work, we propose E2HQV, a novel E2V paradigm designed to produce high-quality video frames from events. This approach leverages a model-aided deep learning framework, underpinned by a theory-inspired E2V model, which is meticulously derived from the fundamental imaging principles of event cameras. To deal with the issue of state-reset in the recurrent components of E2HQV, we also design a temporal shift embedding module to further improve the quality of the video frames. Comprehensive evaluations on the real world event camera datasets validate our approach, with E2HQV, notably outperforming state-of-the-art approaches, e.g., surpassing the second best by over 40% for some evaluation metrics.



Paperid:513
Authors:Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, Anton van den Hengel
Amazon, Amazon, Amazon, Amazon
Abstract:
Inferring the 3D structure of a nonrigid dynamic scene from a single moving camera is an under-constrained problem. Inspired by the remarkable progress of neural radiance fields (NeRFs) in photo-realistic novel view synthesis of static scenes, it has also been extended to dynamic settings. Such methods heavily rely on implicit neural priors to regularize the problem. In this work, we take a step back and investigate how current implementations may entail deleterious effects including limited expressiveness, entanglement of light and density fields, and sub-optimal motion localization. Further, we devise a factorisation-based framework that represents the scene as a composition of bandlimited, high-dimensional signals. We demonstrate compelling results across complex dynamic scenes that involve changes in lighting, texture and long-range dynamics.



Paperid:514
Authors:Qi Rao, Ke Sun, Xiaohan Wang, Qi Wang, Bang Zhang
University of Technology Sydney, Alibaba, Stanford University, Alibaba, Alibaba
Abstract:
Continuous sign language recognition (CSLR) aims to recognize gloss sequences from continuous sign videos. Recent works enhance the gloss representation consistency by mining correlations between visual and contextual modules within individual sentences. However, there still remain much richer correlations among glosses across different sentences. In this paper, we present a simple yet effective CrossSentence Gloss Consistency (CSGC), which enforces glosses belonging to a same category to be more consistent in representation than those belonging to different categories, across all training sentences. Specifically, in CSGC, a prototype is maintained for each gloss category and benefits the gloss discrimination in a contrastive way. Thanks to the well-distinguished gloss prototype, an auxiliary similarity classifier is devised to enhance the recognition clues, thus yielding more accurate results. Extensive experiments conducted on three CSLR datasets show that our proposed CSGC significantly boosts the performance of CSLR, surpassing existing state-of-the-art works by large margins (i.e., 1.6% on PHOENIX14, 2.4% on PHOENIX14-T, and 5.7% on CSL-Daily).



Paperid:515
Authors:Haziq Razali, Yiannis Demiris
Imperial College London, Imperial College London
Abstract:
Learning to forecast bimanual object manipulation sequences from unimanual observations has broad applications in assistive robots and augmented reality. This challenging task requires us to first infer motion from the missing arm and the object it would have been manipulating were the person bimanual, then forecast the human and object motion while maintaining handobject contact during manipulation. Previous attempts model the hand-object interactions only implicitly, and thus tend to produce unrealistic motion where the objects float in air. We address this with a novel neural network that (i) identifies and forecasts the pose for only the objects undergoing motion through an object motion module and (ii) refines human pose predictions by encouraging hand-object contact during manipulation through an ensemble of human pose predictors. The components are also designed to be generic enough for use in both unimanual and bimanual contexts. Our approach outperforms the state-of-the-art pose forecasting methods on bimanual manipulation datasets.



Paperid:516
Authors:Zhiyao Ren, Yibing Zhan, Liang Ding, Gaoang Wang, Chaoyue Wang, Zhongyi Fan, Dacheng Tao
The University of Sydney, JD Explore Academy, JD Explore Academy, Zhejiang University, The University of Sydney, JD Explore Academy, The University of Sydney
Abstract:
Denoising Diffusion Probabilistic Models (DDPMs) have achieved significant success in generation tasks. Nevertheless, the exposure bias issue, i.e., the natural discrepancy between the training (the output of each step is calculated individually by a given input) and inference (the output of each step is calculated based on the input iteratively obtained based on the model), harms the performance of DDPMs. To our knowledge, few works have tried to tackle this issue by modifying the training process for DDPMs, but they still perform unsatisfactorily due to 1) partially modeling the discrepancy and 2) ignoring the prediction error accumulation. To address the above issues, in this paper, we propose a multistep denoising scheduled sampling (MDSS) strategy to alleviate the exposure bias for DDPMs. Analyzing the formulations of the training and inference of DDPMs, MDSS 1) comprehensively considers the discrepancy influence of prediction errors on the output of the model (the Gaussian noise) and the output of the step (the calculated input signal of the next step), and 2) efficiently models the prediction error accumulation by using multiple iterations of a mathematical formulation initialized from one-step prediction error obtained from the model. The experimental results, compared with previous works, demonstrate that our approach is more effective in mitigating exposure bias in DDPM, DDIM, and DPM-solver. In particular, MDSS achieves an FID score of 3.86 in 100 sample steps of DDIM on the CIFAR-10 dataset, whereas the second best obtains 4.78. The code will be available on GitHub.



Paperid:517
Authors:Yi Rong, Haoran Zhou, Lixin Yuan, Cheng Mei, Jiahao Wang, Tong Lu
Nanjing University, Nanjing University, Nanjing University, Nanjing University, Nanjing University, Nanjing University
Abstract:
Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarseto-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devised feature extraction operations with explicit cross-resolution aggregation. We present Cross-Resolution Transformer that efficiently performs cross-resolution aggregation with local attention mechanisms. With the help of our recursive designs, the proposed operation can capture more scales of features than common aggregation operations, which is beneficial for capturing fine geometric characteristics. While prior methodologies have ventured into various manifestations of inter-level cross-resolution aggregation, the effectiveness of intra-level one and their combination has not been analyzed. With unified designs, Cross-Resolution Transformer can perform intra- or inter-level cross-resolution aggregation by switching inputs. We integrate two forms of Cross-Resolution Transformers into one up-sampling block for point generation, and following the coarse-to-fine manner, we construct CRA-PCN to incrementally predict complete shapes with stacked up-sampling blocks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin on several widely used benchmarks. Codes are available at https://github.com/EasyRy/CRA-PCN.



Paperid:518
Authors:Bardia Safaei, Vibashan VS, Celso M. de Melo, Vishal M. Patel
Johns Hopkins University, Johns Hopkins University, DEVCOM Army Research Laboratory, Johns Hopkins University
Abstract:
Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closedset settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting. However, these methods focus more on selecting known samples and do not efficiently utilize unknown samples obtained during AL rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which leverages both known and unknown distributions effectively to select informative samples during AL rounds. Specifically, our approach employs two different entropy scores. One measures the uncertainty of a sample with respect to the known-class distributions. The other measures the uncertainty of the sample with respect to the unknown-class distributions. By utilizing these two entropy scores we effectively separate the known and unknown samples from the unlabeled data resulting in better sampling. Through extensive experiments, we show that the proposed method outperforms existing state-of-the-art methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at https://github.com/bardisafa/EOAL.



Paperid:519
Authors:Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, Gal Chechik
Bar-Ilan University, Ramat-Gan, Israel OriginAI, Tel-Aviv, Israel, OriginAI, Tel-Aviv, Israel, Bar-Ilan University, Ramat-Gan, Israel, OriginAI, Tel-Aviv, Israel, Bar-Ilan University, Ramat-Gan, Israel NVIDIA Research, Tel-Aviv, Israel
Abstract:
Textto-image diffusion models can synthesize high quality images, but they have various limitations. Here we highlight a common failure mode of these models, namely, generating uncommon concepts and structured concepts like hand palms. We show that their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. We characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, using a small reference set of images, a technique that we call SeedSelect. SeedSelect does not require retraining or finetuning the diffusion model. We assess the faithfulness, quality and diversity of SeedSelect in creating rare objects and generating complex formations like hand images, and find it consistently achieves superior performance. We further show the advantage of SeedSelect in semantic data augmentation. Generating semantically appropriate images can successfully improve performance in few-shot recognition benchmarks, for classes from the head and from the tail of the training data of diffusion models.



Paperid:520
Authors:Divya Saxena, Jiannong Cao, Jiahao Xu, Tarun Kulshrestha
The Hong Kong Polytechnic University, Hong Kong, The Hong Kong Polytechnic University, Hong Kong, University of Nevada, Reno, USA, The Hong Kong Polytechnic University, Hong Kong
Abstract:
Training Generative Adversarial Networks (GAN) to generate highquality images typically requires large datasets. Network pruning during training has recently emerged as a significant advancement for data-efficient GAN. However, simple and straightforward pruning can lead to the risk of losing key information, resulting in suboptimal results due to GAN’s competitive dynamics between generator (G) and discriminator (D). Addressing this, we present RG-GAN, a novel approach that marks the first incorporation of dynamic weight regeneration and pruning in GAN training to improve the quality of the generated samples, even with limited data. Specifically, RG-GAN initiates layer-wise dynamic pruning by removing less important weights to the quality of the generated images. While pruning enhances efficiency, excessive sparsity within layers can pose a risk of model collapse. To mitigate this issue, RG-GAN applies a dynamic regeneration method to reintroduce specific weights when they become important, ensuring a balance between sparsity and image quality. Though effective, the sparse network achieved through this process might eliminate some weights important to the combined G and D performance, a crucial aspect for achieving stable and effective GAN training. RG-GAN addresses this loss of weights by integrating learned sparse network weights back into the dense network at the previous stage during a follow-up regeneration step. Our results consistently demonstrate RG-GAN’s robust performance across a variety of scenarios, including different GAN architectures, datasets, and degrees of data scarcity, reinforcing its value as a generic training methodology. Results also show that data augmentation exhibits improved performance in conjunction with RG-GAN. Furthermore, RG-GAN can achieve fewer parameters without compromising, and even enhancing, the quality of the generated samples. Code can be found at this link: https://github.com/IntellicentAI-Lab/RG-GAN



Paperid:521
Authors:Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Michael Felsberg
East China Normal University, Shanghai Jiao Tong University, ETS Montreal, Linköping University
Abstract:
The dot product selfattention (DPSA) is a fundamental component of transformers. However, scaling them to long sequences, like documents or high-resolution images, becomes prohibitively expensive due to the quadratic time and memory complexities arising from the softmax operation. Kernel methods are employed to simplify computations by approximating softmax but often lead to performance drops compared to softmax attention. We propose SeTformer, a novel transformer where DPSA is purely replaced by Self-optimal Transport (SeT) for achieving better performance and computational efficiency. SeT is based on two essential softmax properties: maintaining a non-negative attention matrix and using a nonlinear reweighting mechanism to emphasize important tokens in input sequences. By introducing a kernel cost function for optimal transport, SeTformer effectively satisfies these properties. In particular, with small and base-sized models, SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K. In object detection, SeTformer-base outperforms the FocalNet counterpart by +2.2 mAP, using 38% fewer parameters and 29% fewer FLOPs. In semantic segmentation, our base-size model surpasses NAT by +3.5 mIoU with 33% fewer parameters. SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark. These findings highlight SeTformer applicability for vision and language tasks.



Paperid:522
Authors:Kai Shang, Mingwen Shao, Chao Wang, Yuanshuo Cheng, Shuigen Wang
School of Computer Science and Technology, China University of Petroleum (East China) Shandong Institute of Petroleum and Chemical Technology, School of Computer Science and Technology, China University of Petroleum (East China), ReLER, AAII, University of Technology Sydney, School of Computer Science and Technology, China University of Petroleum (East China), Yantai IRay Technologies Lt. Co.
Abstract:
Diffusion models have achieved remarkable progress in lowlight image enhancement. However, there remain two practical limitations: (1) existing methods mainly focus on the spatial domain for the diffusion process, while neglecting the essential features in the frequency domain; (2) conventional patch-based sampling strategy inevitably leads to severe checkerboard artifacts due to the uneven overlapping. To address these limitations in one go, we propose a Multi-Domain Multi-Scale (MDMS) diffusion model for low-light image enhancement. In particular, we introduce a spatial-frequency fusion module to seamlessly integrates spatial and frequency information. By leveraging the Multi-Domain Learning (MDL) paradigm, our proposed model is endowed with the capability to adaptively facilitate noise distribution learning, thereby enhancing the quality of the generated images. Meanwhile, we propose a Multi-Scale Sampling (MSS) strategy that follows a divide-ensemble manner by merging the restored patches under different resolutions. Such a multi-scale learning paradigm explicitly derives patch information from different granularities, thus leading to smoother boundaries. Furthermore, we empirically adopt the Bright Channel Prior (BCP) which indicates natural statistical regularity as an additional restoration guidance. Experimental results on LOL and LOLv2 datasets demonstrate that our method achieves state-of-the-art performance for the low-light image enhancement task. Codes are available at https://github.com/Oliiveralien/MDMS.



Paperid:523
Authors:Hao Shao, Yang Zhang, Qibin Hou
VCIP, School of Computer Science, Nankai University, Department of Genetics and Cell Biology, College of Life Sciences, Nankai University, VCIP, School of Computer Science, Nankai University
Abstract:
We present a new boundary sensitive framework for polyp segmentation, termed Polyper.Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose to explicitly leverages boundary regions to bolster the model's boundary discrimination capability while minimizing computational resource wastage. Our approach first extracts lowconfidence boundary regions and high-confidence prediction regions from an initial segmentation map through differentiable morphological operators.Then, we design the boundary sensitive attention that concentrates on augmenting the features near the boundary regions using the high-confidence prediction region's characteristics to generate good segmentation results.Our proposed method can be seamlessly integrated with classical encoder networks, like ResNet-50, MiT-B1, and Swin Transformer.To evaludate the effectiveness of Polyper, we conduct experiments on five publicly available challenging datasets, and receive state-of-the-art performance on all of them. Code is available at https://github.com/haoshao-nku/medical_seg.git.



Paperid:524
Authors:Shuai Shao, Yu Bai, Yan Wang, Baodi Liu, Bin Liu
Zhejiang Lab, Zhejiang Lab China University of Petroleum (East China), Beihang University, China University of Petroleum (East China), Zhejiang Lab
Abstract:
OpenWorld Few-Shot Learning (OFSL) is a crucial research field dedicated to accurately identifying target samples in scenarios where data is limited and labels are unreliable. This research holds significant practical implications and is highly relevant to real-world applications. Recently, the advancements in foundation models like CLIP and DINO have showcased their robust representation capabilities even in resource-constrained settings with scarce data. This realization has brought about a transformative shift in focus, moving away from “building models from scratch” towards “effectively harnessing the potential of foundation models to extract pertinent prior knowledge suitable for OFSL and utilizing it sensibly”. Motivated by this perspective, we introduce the Collaborative Consortium of Foundation Models (CO3), which leverages CLIP, DINO, GPT-3, and DALL-E to collectively address the OFSL problem. CO3 comprises four key blocks: (1) the Label Correction Block (LC-Block) corrects unreliable labels, (2) the Data Augmentation Block (DA-Block) enhances available data, (3) the Feature Extraction Block (FE-Block) extracts multi-modal features, and (4) the Text-guided Fusion Adapter (TeFu-Adapter) integrates multiple features while mitigating the impact of noisy labels through semantic constraints. Only the adapter's parameters are adjustable, while the others remain frozen. Through collaboration among these foundation models, CO3 effectively unlocks their potential and unifies their capabilities to achieve state-of-the-art performance on multiple benchmark datasets. https://github.com/The-Shuai/CO3.



Paperid:525
Authors:Gil Shapira, Yosi Keller
Samsung Bar-Ilan University, Bar Ilan University
Abstract:
In setbased face recognition, we aim to compute the most discriminative descriptor from an unbounded set of images and videos showing a single person. A discriminative descriptor balances two policies when aggregating information from a given set. The first is a quality-based policy: emphasizing high-quality and down-weighting low-quality images. The second is a diversity-based policy: emphasizing unique images in the set and down-weighting multiple occurrences of similar images as found in video clips which can overwhelm the set representation. This work frames face-set representation as a differentiable coreset selection problem. Our model learns how to select a small coreset of the input set that balances quality and diversity policies using a learned metric parameterized by the face quality, optimized end-to-end. The selection process is a differentiable farthest-point sampling (FPS) realized by approximating the non-differentiable Argmax operation with differentiable sampling from the Gumbel-Softmax distribution of distances. The small coreset is later used as queries in a self and cross-attention architecture to enrich the descriptor with information from the whole set. Our model is order-invariant and linear in the input set size. We set a new SOTA to set face verification on the IJB-B and IJB-C datasets. Our code is publicly available at https://github.com/ligaripash/FaceCoresetNet.



Paperid:526
Authors:Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, Jinzhi Wang
Peking University, Peking University, The Chinese academy of science, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Peking University
Abstract:
The goal of conditional imageto-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text. The previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity. Additionally, the efficiency of generating videos in pixel space is quite low. In this paper, we propose a novel approach to address these challenges by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions. Specifically, we predict temporal motions which include motion vector and residual based on a 3D-UNet diffusion model. By explicitly modeling temporal motions and warping them to the starting image, we improve the temporal consistency of generated videos. This results in a reduction of spatial redundancy, emphasizing temporal details. Our proposed method achieves performance improvements by disentangling content and motion, all without introducing new structural complexities to the model. Extensive experiments on various datasets confirm our approach's superior performance over the majority of state-of-the-art methods in both effectiveness and efficiency.



Paperid:527
Authors:Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin
Zhejiang University, Binjiang Institute of Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
Visual grounding, a crucial visionlanguage task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.



Paperid:528
Authors:Hongyu Shen, Mingtao Pei, Juncai Liu, Zhaoxing Tian
Beijing Institute of Technology, Beijing Institute of Technology, Shandong University of Science and Technology, Beijing Jishuitan Hospital
Abstract:
The automatic generation of radiology reports is of great significance, which can reduce the workload of doctors and improve the accuracy and reliability of medical diagnosis and treatment, and has attracted wide attention in recent years. Crossmodal mapping between images and text, a key component of generating high-quality reports, is challenging due to the lack of corresponding annotations. Despite its importance, previous studies have often overlooked it or lacked adequate designs for this crucial component. In this paper, we propose a method with memory alignment embedding to assist the model in aligning visual and textual features to generate a coherent and informative report. Specifically, we first get the memory alignment embedding by querying the memory matrix, where the query is derived from a combination of the visual features and their corresponding positional embeddings. Then the alignment between the visual and textual features can be guided by the memory alignment embedding during the generation process. The comparison experiments with other alignment methods show that the proposed alignment method is less costly and more effective. The proposed approach achieves better performance than state-of-the-art approaches on two public datasets IU X-Ray and MIMIC-CXR, which further demonstrates the effectiveness of the proposed alignment method.



Paperid:529
Authors:Junao Shen, Kun Kuang, Jiaheng Wang, Xinyu Wang, Tian Feng, Wei Zhang
School of Software Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, School of Software Technology, Zhejiang University, School of Software Technology, Zhejiang University, School of Software Technology, Zhejiang University, School of Software Technology, Zhejiang University Innovation Center of Yangtze River Delta, Zhejiang University
Abstract:
Fewshot semantic segmentation (FSS) aims to segment unseen objects in a query image using a few pixel-wise annotated support images, thus expanding the capabilities of semantic segmentation. The main challenge lies in extracting sufficient information from the limited support images to guide the segmentation process. Conventional methods typically address this problem by generating single or multiple prototypes from the support images and calculating their cosine similarity to the query image. However, these methods often fail to capture meaningful information for modeling the de facto joint distribution of pixel and category. Consequently, they result in incomplete segmentation of foreground objects and mis-segmentation of the complex background. To overcome this issue, we propose the Cross Gaussian Mixture Generative Model (CGMGM), a novel Gaussian Mixture Models~(GMMs)-based FSS method, which establishes the joint distribution of pixel and category in both the support and query images. Specifically, our method initially matches the feature representations of the query image with those of the support images to generate and refine an initial segmentation mask. It then employs GMMs to accurately model the joint distribution of foreground and background using the support masks and the initial segmentation mask. Subsequently, a parametric decoder utilizes the posterior probability of pixels in the query image, by applying the Bayesian theorem, to the joint distribution, to generate the final segmentation mask. Experimental results on PASCAL-5i and COCO-20i datasets demonstrate our CGMGM's effectiveness and superior performance compared to the state-of-the-art methods.



Paperid:530
Authors:Lingdong Shen, Chunlei Huo, Nuo Xu, Chaowei Han, Zichen Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation, Chinese Academy of Sciences, School of Information Engineering, Capital Normal University School of Artificial Intelligence, University of Chinese Academy of Sciences NLPR, Institute of Automation, Chinese Academy of Sciences, Zhejiang Lab, School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences MAIS, Institute of Automation, Chinese Academy of Sciences
Abstract:
Passive object detectors, trained on largescale static datasets, often overlook the feedback from object detection to image acquisition. Embodied vision and active detection mitigate this issue by interacting with the environment. Nevertheless, the materialization of activeness hinges on resource-intensive data collection and annotation. To tackle these challenges, we propose a collaborative student-teacher framework. Technically, a replay buffer is built based on the trajectory data to encapsulate the relationship of state, action, and reward. In addition, the student network diverges from reinforcement learning by redefining sequential decision pathways using a GPT structure enriched with causal self-attention. Moreover, the teacher network establishes a subtle state-reward mapping based on adjacent benefit differences, providing reliable rewards for student adaptively self-tuning with the vast unlabeled replay buffer data. Additionally, an innovative yet straightforward benefit reference value is proposed within the teacher network, adding to its effectiveness and simplicity. Leveraging a flexible replay buffer and embodied collaboration between teacher and student, the framework learns to see before detection with shallower features and shorter inference steps. Experiments highlight significant advantages of our algorithm over state-of-the-art detectors. The code is released at https://github.com/lydonShen/STF.



Paperid:531
Authors:Xiaobo Shen, Peizhuo Song, Yun-Hao Yuan, Yuhui Zheng
Nanjing University of Science and Technology, Nanjing University of Science and Technology, Yangzhou University, Qinghai Normal University Nanjing University of Information Science and Technology
Abstract:
Conventional image set methods typically learn from image sets stored in one location. However, in realworld applications, image sets are often distributed or collected across different positions. Learning from such distributed image sets presents a challenge that has not been studied thus far. Moreover, efficiency is seldom addressed in large-scale image set applications. To fulfill these gaps, this paper proposes Distributed Manifold Hashing (DMH), which models distributed image sets as a connected graph. DMH employs Riemannian manifold to effectively represent each image set and further suggests learning hash code for each image set to achieve efficient computation and storage. DMH is formally formulated as a distributed learning problem with local consistency constraint on global variables among neighbor nodes, and can be optimized in parallel. Extensive experiments on three benchmark datasets demonstrate that DMH achieves highly competitive accuracies in a distributed setting and provides faster classification and retrieval than state-of-the-arts.



Paperid:532
Authors:Xiaolong Shen, Jianxin Ma, Chang Zhou, Zongxin Yang
Zhejiang University Alibaba Group, Alibaba Group, Alibaba Group, Zhejiang University
Abstract:
Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on timeconsuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions. Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods, which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be publicly available.



Paperid:533
Authors:Mengmeng Sheng, Zeren Sun, Zhenhuang Cai, Tao Chen, Yichao Zhou, Yazhou Yao
Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University of Science and Technology
Abstract:
There has been significant attention devoted to the effectiveness of various domains, such as semisupervised learning, contrastive learning, and meta-learning, in enhancing the performance of methods for noisy label learning (NLL) tasks. However, most existing methods still depend on prior assumptions regarding clean samples amidst different sources of noise (e.g., a pre-defined drop rate or a small subset of clean samples). In this paper, we propose a simple yet powerful idea called NPN, which revolutionizes Noisy label learning by integrating Partial label learning (PLL) and Negative learning (NL). Toward this goal, we initially decompose the given label space adaptively into the candidate and complementary labels, thereby establishing the conditions for PLL and NL. We propose two adaptive data-driven paradigms of label disambiguation for PLL: hard disambiguation and soft disambiguation. Furthermore, we generate reliable complementary labels using all non-candidate labels for NL to enhance model robustness through indirect supervision. To maintain label reliability during the later stage of model training, we introduce a consistency regularization term that encourages agreement between the outputs of multiple augmentations. Experiments conducted on both synthetically corrupted and real-world noisy datasets demonstrate the superiority of NPN compared to other state-of-the-art (SOTA) methods. The source code has been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NPN.



Paperid:534
Authors:Jinsong Shi, Pan Gao, Jie Qin
Nanjing University of Aeronautics and Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing University of Aeronautics and Astronautics
Abstract:
Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially NoReference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic distortion datasets. To further improve NR-IQA performance, in this paper, a novel supervised contrastive learning (SCL) and Transformer-based NR-IQA model SaTQA is proposed. We first train a model on a large-scale synthetic dataset by SCL (no image subjective score is required) to extract degradation features of images with various distortion types and levels. To further extract distortion information from images, we propose a backbone network incorporating the Multi-Stream Block (MSB) by combining the CNN inductive bias and Transformer long-term dependence modeling capability. Finally, we propose the Patch Attention Block (PAB) to obtain the final distorted image quality score by fusing the degradation features learned from contrastive learning with the perceptual distortion information extracted by the backbone network. Experimental results on six standard IQA datasets show that SaTQA outperforms the state-of-the-art methods for both synthetic and authentic datasets. Code is available at https://github.com/I2-Multimedia-Lab/SaTQA.



Paperid:535
Authors:Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, Xianxian Li
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Harbin Institute of Technology, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University
Abstract:
How to effectively exploit spatiotemporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the when-and-how-to-update dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed EVPTrack. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of when-to-update, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding how-to-update. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOText, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.



Paperid:536
Authors:Ruohua Shi, Lingyu Duan, Tiejun Huang, Tingting Jiang
National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University Peng Cheng Laboratory, National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University Beijing Academy of Artificial Intelligence, National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University National Biomedical Imaging Center, Peking University
Abstract:
Recent advances in deep learning have greatly improved the segmentation of mitochondria from Electron Microscopy (EM) images. However, suffering from variations in mitochondrial morphology, imaging conditions, and image noise, existing methods still exhibit high uncertainty in their predictions. Moreover, in view of our findings, predictions with high levels of uncertainty are often accompanied by inaccuracies such as ambiguous boundaries and amount of false positive segments. To deal with the above problems, we propose a novel approach for mitochondria segmentation in 3D EM images that leverages evidential uncertainty estimation, which for the first time integrates evidential uncertainty to enhance the performance of segmentation. To be more specific, our proposed method not only provides accurate segmentation results, but also estimates associated uncertainty. Then, the estimated uncertainty is used to help improve the segmentation performance by an uncertainty rectification module, which leverages uncertainty maps and multiscale information to refine the segmentation. Extensive experiments conducted on four challenging benchmarks demonstrate the superiority of our proposed method over existing approaches.



Paperid:537
Authors:Sang-Heon Shim, Jiwoo Chung, Jae-Pil Heo
Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University
Abstract:
In this paper, we first investigate a visual quality degradation problem observed in recent highresolution virtual try-on approach. The tendency is empirically found that the textures of clothes are squeezed at the sleeve, as visualized in the upper row of Fig.1(a). A main reason for the issue arises from a gradient conflict between two popular losses, the Total Variation (TV) and adversarial losses. Specifically, the TV loss aims to disconnect boundaries between the sleeve and torso in a warped clothing mask, whereas the adversarial loss aims to combine between them. Such contrary objectives feedback the misaligned gradients to a cascaded appearance flow estimation, resulting in undesirable squeezing artifacts. To reduce this, we propose a Sequential Deformation (SD-VITON) that disentangles the appearance flow prediction layers into TV objective-dominant (TVOB) layers and a task-coexistence (TACO) layer. Specifically, we coarsely fit the clothes onto a human body via the TVOB layers, and then keep on refining via the TACO layer. In addition, the bottom row of Fig.1(a) shows a different type of squeezing artifacts around the waist. To address it, we further propose that we first warp the clothes into a tucked-out shirts style, and then partially erase the texture from the warped clothes without hurting the smoothness of the appearance flows. Experimental results show that our SD-VITON successfully resolves both types of artifacts and outperforms the baseline methods. Source code will be available at https://github.com/SHShim0513/SD-VITON.



Paperid:538
Authors:Zhongyi Shui, Sunyi Zheng, Chenglu Zhu, Shichuan Zhang, Xiaoxuan Yu, Honglin Li, Jingxiong Li, Pingyi Chen, Lin Yang
Zhejiang University Westlake University, Westlake University, Westlake University, Zhejiang University Westlake University, Zhejiang University Westlake University, Zhejiang University Westlake University, Zhejiang University Westlake University, Zhejiang University Westlake University, Westlake University
Abstract:
Pointbased cell detection (PCD), which pursues high-performance cell sensing under low-cost data annotation, has garnered increased attention in computational pathology community. Unlike mainstream PCD methods that rely on intermediate density map representations, the Point-to-Point network (P2PNet) has recently emerged as an end-to-end solution for PCD, demonstrating impressive cell detection accuracy and efficiency. Nevertheless, P2PNet is limited to decoding from a single-level feature map due to the scale-agnostic property of point proposals, which is insufficient to leverage multi-scale information. Moreover, the spatial distribution of pre-set point proposals is biased from that of cells, leading to inaccurate cell localization. To lift these limitations, we present DPA-P2PNet in this work. The proposed method directly extracts multi-scale features for decoding according to the coordinates of point proposals on hierarchical feature maps. On this basis, we further devise deformable point proposals to mitigate the positional bias between proposals and potential cells to promote cell localization. Inspired by practical pathological diagnosis that usually combines high-level tissue structure and low-level cell morphology for accurate cell classification, we propose a multi-field-of-view (mFoV) variant of DPA-P2PNet to accommodate additional large FoV images with tissue information as model input. Finally, we execute the first self-supervised pre-training on immunohistochemistry histopathology image data and evaluate the suitability of four representative self-supervised methods on the PCD task. Experimental results on three benchmarks and a large-scale and real-world interval dataset demonstrate the superiority of our proposed models over the state-of-the-art counterparts. Codes and pre-trained weights are available at https://github.com/windygoo/DPA-P2PNet.



Paperid:539
Authors:Nyle Siddiqui, Praveen Tirupattur, Mubarak Shah
Center for Research in Computer Vision, University of Central Florida, Center for Research in Computer Vision, University of Central Florida, Center for Research in Computer Vision, University of Central Florida
Abstract:
In this work, we present a novel approach to multiview action recognition where we guide learned action representations to be separated from view-relevant information in a video. When trying to classify action instances captured from multiple viewpoints, there is a higher degree of difficulty due to the difference in background, occlusion, and visibility of the captured action from different camera angles. To tackle the various problems introduced in multi-view action recognition, we propose a novel configuration of learnable transformer decoder queries, in conjunction with two supervised contrastive losses, to enforce the learning of action features that are robust to shifts in viewpoints. Our disentangled feature learning occurs in two stages: the transformer decoder uses separate queries to separately learn action and view information, which are then further disentangled using our two contrastive losses. We show that our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets: NTU RGB+D, NTU RGB+D 120, PKU-MMD, and N-UCLA. Compared to previous RGB works, we see maximal improvements of 1.5%, 4.8%, 2.2%, and 4.8% on each dataset, respectively. Our code can be found here: https://github.com/NyleSiddiqui/MultiView_Actions



Paperid:540
Authors:Jaeyoon Sim, Sooyeon Jeon, InJun Choi, Guorong Wu, Won Hwa Kim
Pohang University of Science and Technology, Pohang, South Korea, Pohang University of Science and Technology, Pohang, South Korea, Pohang University of Science and Technology, Pohang, South Korea, University of North Carolina at Chapel Hill, Chapel Hill, USA, Pohang University of Science and Technology, Pohang, South Korea
Abstract:
Various Graph Neural Networks (GNN) have been successful in analyzing data in nonEuclidean spaces, however, they have limitations such as oversmoothing, i.e., information becomes excessively averaged as the number of hidden layers increases. The issue stems from the intrinsic formulation of conventional graph convolution where the nodal features are aggregated from a direct neighborhood per layer across the entire nodes in the graph. As setting different number of hidden layers per node is infeasible, recent works leverage a diffusion kernel to redefine the graph structure and incorporate information from farther nodes. Unfortunately, such approaches suffer from heavy diagonalization of a graph Laplacian or learning a large transform matrix. In this regards, we propose a diffusion learning framework where the range of feature aggregation is controlled by the scale of a diffusion kernel. For efficient computation, we derive closed-form derivatives of approximations of the graph convolution with respect to the scale, so that node-wise range can be adaptively learned.With a downstream classifier, the entire framework is made trainable in an end-to-end manner. Our model is tested on various standard datasets for node-wise classification for the state-of-the-art performance, and it is also validated on a real-world brain network data for graph classifications to demonstrate its practicality for Alzheimer classification.



Paperid:541
Authors:Ayush Singh, Aayush J Rana, Akash Kumar, Shruti Vyas, Yogesh Singh Rawat
IIT(ISM),Dhanbad, University of Central Florida, University of Central Florida, University of Central Florida, University of Central Florida
Abstract:
In this work, we focus on label efficient learning for video action detection. We develop a novel semisupervised active learning approach which utilizes both labeled as well as un- labeled data along with informative sample selection for ac- tion detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning (informative sample se- lection) as well as semi-supervised learning (pseudo label generation). First, we propose NoiseAug, a simple augmenta- tion strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different bench- mark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detec- tion where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB- 21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos.



Paperid:542
Authors:Chen Song, Chandrajit Bajaj, Qixing Huang
The University of Texas at Austin, The University of Texas at Austin, The University of Texas at Austin
Abstract:
We present DeblurSR, a novel motion deblurring approach that converts a blurry image into a sharp video. DeblurSR utilizes event data to compensate for motion ambiguities and exploits the spiking representation to parameterize the sharp output video as a mapping from time to intensity. Our key contribution, the Spiking Representation (SR), is inspired by the neuromorphic principles determining how biological neurons communicate with each other in living organisms. We discuss why the spikes can represent sharp edges and how the spiking parameters are interpreted from the neuromorphic perspective. DeblurSR has higher output quality and requires fewer computing resources than stateof-the-art event-based motion deblurring methods. We additionally show that our approach easily extends to video super-resolution when combined with recent advances in implicit neural representation.



Paperid:543
Authors:Heping Song, Jingyao Gong, Hongying Meng, Yuping Lai
School of Computer Science and Communication Engineering, Jiangsu University, China, School of Computer Science and Communication Engineering, Jiangsu University, China, Electronic and Electrical Engineering Department, Brunel University London, United Kingdom, School of Cyberspace Security, Beijing University of Posts and Telecommunications, China
Abstract:
Deep Compressed Sensing (DCS) has attracted considerable interest due to its superior quality and speed compared to traditional CS algorithms. However, current approaches employ simplistic convolutional downsampling to acquire measurements, making it difficult to retain highlevel features of the original signal for better image reconstruction. Furthermore, these approaches often overlook the presence of both high- and low-frequency information within the network, despite their critical role in achieving high-quality reconstruction. To address these challenges, we propose a novel Multi-Cross Sampling and Frequency Division Network (MCFD-Net) for image CS. The Dynamic Multi-Cross Sampling (DMCS) module, a sampling network of MCFD-Net, incorporates pyramid cross convolution and dual-branch sampling with multi-level pooling. Additionally, it introduces an attention mechanism between perception blocks to enhance adaptive learning effects. In the second deep reconstruction stage, we design a Frequency Division Reconstruction Module (FDRM). This module employs a discrete wavelet transform to extract high- and low-frequency information from images. It then applies multi-scale convolution and self-similarity attention compensation separately to both types of information before merging the output reconstruction results. The MCFD-Net integrates the DMCS and FDRM to construct an end-to-end learning network. Extensive CS experiments conducted on multiple benchmark datasets demonstrate that our MCFD-Net outperforms state-of-the-art approaches, while also exhibiting superior noise robustness.



Paperid:544
Authors:Huihui Song, Tiankang Su, Yuhui Zheng, Kaihua Zhang, Bo Liu, Dong Liu
B-DAT and CICAEET, Nanjing University of Information Science and Technology, Nanjing, China, B-DAT and CICAEET, Nanjing University of Information Science and Technology, Nanjing, China, B-DAT and CICAEET, Nanjing University of Information Science and Technology, Nanjing, Chin College of Computer, Qinghai Normal University, Xining 810016, China, B-DAT and CICAEET, Nanjing University of Information Science and Technology, Nanjing, China, Walmart Global Tech, Sunnyvale, CA, 94086, USA, Netflix Inc, Los Gatos, CA, 95032, USA
Abstract:
The performance of existing unsupervised video object segmentation methods typically suffers from severe performance degradation on test videos when tested in outof-distribution scenarios. The primary reason is that the test data in real- world may not follow the independent and identically distribution (i.i.d.) assumption, leading to domain shift. In this paper, we propose a generalizable fourier augmentation method during training to improve the generalization ability of the model. To achieve this, we perform Fast Fourier Transform (FFT) over the intermediate spatial domain features in each layer to yield corresponding frequency representations, including amplitude components (encoding scene-aware styles such as texture, color, contrast of the scene) and phase components (encoding rich semantics). We produce a variety of style features via Gaussian sampling to augment the training data, thereby improving the generalization capability of the model. To further improve the cross-domain generalization performance of the model, we design a phase feature update strategy via exponential moving average using phase features from past frames in an online update manner, which could help the model to learn cross-domain-invariant features. Extensive experiments show that our proposed method achieves the state-of-the-art performance on popular benchmarks.



Paperid:545
Authors:Kaiyou Song, Shan Zhang, Tong Wang
MEGVII Technology, MEGVII Technology, MEGVII Technology
Abstract:
The development of autoregressive modeling (AM) in computer vision lags behind natural language processing (NLP) in selfsupervised pre-training. This is mainly caused by the challenge that images are not sequential signals and lack a natural order when applying autoregressive modeling. In this study, inspired by human beings’ way of grasping an image, i.e., focusing on the main object first, we present a semantic-aware autoregressive image modeling (SemAIM) method to tackle this challenge. The key insight of SemAIM is to autoregressively model images from the semantic patches to the less semantic patches. To this end, we first calculate a semantic-aware permutation of patches according to their feature similarities and then perform the autoregression procedure based on the permutation. In addition, considering that the raw pixels of patches are low-level signals and are not ideal prediction targets for learning high-level semantic representation, we also explore utilizing the patch features as the prediction targets. Extensive experiments are conducted on a broad range of downstream tasks, including image classification, object detection, and instance/semantic segmentation, to evaluate the performance of SemAIM. The results demonstrate SemAIM achieves state-of-the-art performance compared with other self-supervised methods. Specifically, with ViT-B, SemAIM achieves 84.1% top-1 accuracy for fine-tuning on ImageNet, 51.3% AP and 45.4% AP for object detection and instance segmentation on COCO, which outperforms the vanilla MAE by 0.5%, 1.0%, and 0.5%, respectively. Code is available at https://github.com/skyoux/SemAIM.



Paperid:546
Authors:Mingchen Song, Huiqiang Wang, Guoqiang Zhong
Ocean University of China, Ocean University of China, Ocean University of China
Abstract:
Fewshot learning poses a formidable challenge as it necessitates effective recognition of novel classes based on a limited set of examples. Recent studies have sought to address the challenge of rare samples by tuning visual features through the utilization of external text prompts. However, the performance of these methods is constrained due to the inherent modality gap between the prompt text and image features. Instead of naively utilizing the external semantic information generated from text to guide the training of the image encoder, we propose a novel self-prompt mechanism (SPM) to adaptively adjust the neural network according to unseen data. Specifically, SPM involves a systematic selection of intrinsic semantic features generated by the image encoder across spatial and channel dimensions, thereby engendering self-prompt information. Subsequently, upon backpropagation of this self-prompt information to the deeper layers of the neural network, it effectively steers the network toward the learning and adaptation of new samples. Meanwhile, we propose a novel parameter-efficient tuning method that exclusively fine-tunes the parameters relevant to self-prompt (prompts are no more than 2% of the total parameters), and the incorporation of additional learnable parameters as self-prompt ensures the retention of prior knowledge through frozen encoder weights. Therefore, our method is highly suited for few-shot recognition tasks that require both information retention and adaptive adjustment of network parameters with limited labeling data constraints. Extensive experiments demonstrate the effectiveness of the proposed SPM in both 5-way 1-shot and 5-way 5-shot settings for standard single-domain and cross-domain few-shot recognition datasets, respectively. Our code is available at https://github.com/codeshop715/SPM.



Paperid:547
Authors:Zifan Song, Guosheng Hu, Cairong Zhao
Tongji University, Oosto, Tongji University
Abstract:
Textbased person search is a challenging task aimed at locating specific target pedestrians through text descriptions. Recent advancements have been made in this field, but there remains a deficiency in datasets tailored for text-based person search. The creation of new, real-world datasets is hindered by concerns such as the risk of pedestrian privacy leakage and the substantial costs of annotation. In this paper, we introduce a framework, named Diverse Person (DP), to achieve efficient and high-quality text-based person search data generation without involving privacy concerns. Specifically, we propose to leverage available images of clothing and accessories as reference attribute images to edit the original dataset images through diffusion models. Additionally, we employ a Large Language Model (LLM) to produce annotations that are both high in quality and stylistically consistent with those found in real-world datasets. Extensive experimental results demonstrate that the baseline models trained with our DP can achieve new state-of-the-art results on three public datasets, with performance improvements up to 4.82%, 2.15%, and 2.28% on CUHK-PEDES, ICFG-PEDES, and RSTPReid in terms of Rank-1 accuracy, respectively.



Paperid:548
Authors:Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, Timo Denk
University of Washington, Google Research, ByteDance, Google Research, Google Research Seoul National University, Google DeepMind Carnegie Mellon University, Google Research, Google Research, New York University, Google DeepMind, Google DeepMind
Abstract:
Videoto-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.



Paperid:549
Authors:Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Recently Textto-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs. Previous inference acceleration works either require costly retraining or are model-specific. To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models. The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames. Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights. Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned. Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.



Paperid:550
Authors:Yuchao Su, Yuanman Li, Wei Wang, Jiantao Zhou, Xia Li
Shenzheng University, Shenzhen University, Shenzhen MSU-BIT University, University of Macau, Shenzhen University
Abstract:
Accurately predicting pedestrian movements in complex environments is challenging due to social interactions, scene constraints, and pedestrians' multimodal behaviors. Sequential models like long shortterm memory fail to effectively integrate scene features to make predicted trajectories comply with scene constraints due to disparate feature modalities of scene and trajectory. Though existing convolution neural network (CNN) models can extract scene features, they are ineffective in mapping these features into scene constraints for pedestrians and struggle to model pedestrian interactions due to the loss of target pedestrian information. To address these issues, we propose a unified environmental network based on CNN for pedestrian trajectory prediction. We introduce a polar-based method to reflect the distance and direction relationship between any position in the environment and the target pedestrian. This enables us to simultaneously model scene constraints and pedestrian social interactions in the form of feature maps. Additionally, we capture essential local features in the feature map, characterizing potential multimodal movements of pedestrians at each time step to prevent redundant predicted trajectories. We verify the performance of our proposed model on four trajectory prediction datasets, encompassing both short-term and long-term predictions. The experimental results demonstrate the superiority of our approach over existing methods.



Paperid:551
Authors:Yuchen Su, Zhineng Chen, Zhiwen Shao, Yuning Du, Zhilong Ji, Jinfeng Bai, Yong Zhou, Yu-Gang Jiang
Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University Baidu Inc., Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University, China University of Mining and Technology, Baidu Inc., Tomorrow Advancing Life, Tomorrow Advancing Life, China University of Mining and Technology, Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University
Abstract:
Recently, regressionbased methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods. Code is available at: https://github.com/ychensu/LRANet.git



Paperid:552
Authors:Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, Qingyao Wu
South China University of Technology Tencent, South China University of Technology, South China University of Technology, Tencent, South China University of Technology Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
Abstract:
A large amount of User Generated Content (UGC) is uploaded to the Internet daily and displayed to people worldwidely through the client side (mobile and PC). This requires the cropping algorithms to produce the aesthetic thumbnail within a specific aspect ratio on different devices. However, existing image cropping works mainly focus on landmark or landscape images, which fail to model the relations among the multi-objects with the complex background in UGC. Besides, previous methods merely consider the aesthetics of the cropped images while ignoring the content integrity, which is crucial for UGC cropping. In this paper, we propose a Spatial-Semantic Collaborative cropping network (S2CNet) for arbitrary user generated content accompanied by a new cropping benchmark. Specifically, we first mine the visual genes of the potential objects. Then, the suggested adaptive attention graph recasts this task as a procedure of information association over visual nodes. The underlying spatial and semantic relations are ultimately centralized to the crop candidate through differentiable message passing, which helps our network efficiently to preserve both the aesthetics and the content integrity. Extensive experiments on the proposed UGCrop5K and other public datasets demonstrate the superiority of our approach over state-of-the-art counterparts.



Paperid:553
Authors:Hao Sun, Mingyao Zhou, Wenjing Chen, Wei Xie
Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China School of Computer Science, Central China Normal University, Wuhan, China National Language Resources Monitoring and Research Center for Network Media, Central China Normal University, Wuhan, China, Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China School of Computer Science, Central China Normal University, Wuhan, China National Language Resources Monitoring and Research Center for Network Media, Central China Normal University, Wuhan, China, School of Computer Science, Hubei University of Technology, Wuhan, China, Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China School of Computer Science, Central China Normal University, Wuhan, China National Language Resources Monitoring and Research Center for Network Media, Central China Normal University, Wuhan, China
Abstract:
Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETRbased networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at https://github.com/mingyao1120/TR-DETR.



Paperid:554
Authors:Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang
Zhejiang University-University of Illinois Urbana Champaign Institute, Zhejiang University, College of Computer Science and Technology, Zhejiang University, Electrical and Computer Engineering Department, University of Washington, Zhejiang University-University of Illinois Urbana Champaign Institute, Zhejiang University, Zhejiang University-University of Illinois Urbana Champaign Institute, Zhejiang University, Department of Computer Science and Technology, Donghua University, Electrical and Computer Engineering Department, University of Washington, Zhejiang University-University of Illinois Urbana Champaign Institute, Zhejiang University College of Computer Science and Technology, Zhejiang University Shanghai Artificial Intelligence Laboratory
Abstract:
Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animalrelated research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare species, and the semantic inconsistency of different tasks. We introduce UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks. Our proposed model takes support images and labels as prompt guidance for a query image. Images and labels are processed through a Transformer-based encoder and a lightweight label encoder, respectively. Then a matching module is designed for aggregating information between prompt guidance and the query image, followed by a multi-head label decoder to generate outputs for various tasks. By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species. We demonstrate the effectiveness of UniAP through comprehensive experiments in pose estimation, segmentation, and classification tasks on diverse animal species, showcasing its ability to generalize and adapt to new classes with minimal labeled examples.



Paperid:555
Authors:Shoukun Sun, Min Xian, Fei Xu, Luca Capriotti, Tiankai Yao
University of Idaho, University of Idaho, Idaho National Laboratory, Idaho National Laboratory, Idaho National Laboratory
Abstract:
The clickbased interactive segmentation aims to extract the object of interest from an image with the guidance of user clicks. Recent work has achieved great overall performance by employing feedback from the output. However, in most state-of-the-art approaches, 1) the inference stage involves inflexible heuristic rules and requires a separate refinement model, and 2) the number of user clicks and model performance cannot be balanced. To address the challenges, we propose a click-based and mask-guided interactive image segmentation framework containing three novel components: Cascade-Forward Refinement (CFR), Iterative Click Loss (ICL), and SUEM image augmentation. The CFR offers a unified inference framework to generate segmentation results in a coarse-to-fine manner. The proposed ICL allows model training to improve segmentation and reduce user interactions simultaneously. The proposed SUEM augmentation is a comprehensive way to create large and diverse training sets for interactive image segmentation. Extensive experiments demonstrate the state-of-the-art performance of the proposed approach on five public datasets. Remarkably, our model reduces by 33.2%, and 15.5% the number of clicks required to surpass an IoU of 0.95 in the previous state-of-the-art approach on the Berkeley and DAVIS sets, respectively.



Paperid:556
Authors:Xinyu Sun, Zhikun Zhao, Lili Wei, Congyan Lang, Mingxuan Cai, Longfei Han, Juan Wang, Bing Li, Yuxuan Guo
Key Laboratory of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer and Information Technology, Beijing Jiaotong University Institute of Automation, Chinese Academy of Sciences, Key Laboratory of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer and Information Technology, Beijing Jiaotong University Institute of Automation, Chinese Academy of Sciences, Key Laboratory of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer and Information Technology, Beijing Jiaotong University, Key Laboratory of Big Data & Artificial Intelligence in Transportation (Ministry of Education), School of Computer and Information Technology, Beijing Jiaotong University, Shanghai Jiaotong University, Beijing Technology and Business University, Institute of Automation, Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences PeopleAI Inc. Beijing, China, Shenzhen Heytap Technology Corp., Ltd
Abstract:
Hardware image signal processing (ISP), aiming at converting RAW inputs to RGB images, consists of a series of processing blocks, each with multiple parameters. Traditionally, ISP parameters are manually tuned in isolation by imaging experts according to applicationspecific quality and performance metrics, which is time-consuming and biased towards human perception due to complex interaction with the output image. Since the relationship between any single parameter’s variation and the output performance metric is a complex, non-linear function, optimizing such a large number of ISP parameters is challenging. To address this challenge, we propose a novel Sequential ISP parameter optimization model, called the RL-SeqISP model, which utilizes deep reinforcement learning to jointly optimize all ISP parameters for a variety of imaging applications. Concretely, inspired by the sequential tuning process of human experts, the proposed model can progressively enhance image quality by seamlessly integrating information from both the image feature space and the parameter space. Furthermore, a dynamic parameter optimization module is introduced to avoid ISP parameters getting stuck into local optima, which is able to more effectively guarantee the optimal parameters resulting from the sequential learning strategy. These merits of the RL-SeqISP model as well as its high efficiency are substantiated by comprehensive experiments on a wide range of downstream tasks, including two visual analysis tasks (instance segmentation and object detection), and image quality assessment (IQA), as compared with representative methods both quantitatively and qualitatively. In particular, even using only 10% of the training data, our model outperforms other SOTA methods by an average of 7% mAP on two visual analysis tasks.



Paperid:557
Authors:Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, Lin Yang
College of Computer Science and Technology, Zhejiang University, China Research Center for Industries of the Future and School of Engineering, Westlake University, China, Research Center for Industries of the Future and School of Engineering, Westlake University, China, Research Center for Industries of the Future and School of Engineering, Westlake University, China, Department of Computer Science and Engineering, The Ohio State University, USA, School of Computer and Computing Science, Hangzhou City University, China, College of Computer Science and Technology, Zhejiang University, China Research Center for Industries of the Future and School of Engineering, Westlake University, China, College of Computer Science and Technology, Zhejiang University, China Research Center for Industries of the Future and School of Engineering, Westlake University, China, College of Computer Science and Technology, Zhejiang University, China Research Center for Industries of the Future and School of Engineering, Westlake University, China, Research Center for Industries of the Future and School of Engineering, Westlake University, China
Abstract:
As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of generalpurpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes. We open-source our dataset, as well as a comprehensive toolkit for extensive pathology data collection and preprocessing at https://github.com/superjamessyx/Generative-Foundation-AI-Assistant-for-Pathology.



Paperid:558
Authors:Zhaoxu Sun, Yuze Xuan, Fang Liu, Yang Xiang
Xiaobing.ai, Xiaobing.ai, State Key Laboratory of Media Convergence and Communication, Communication University of China, Xiaobing.ai
Abstract:
Although deep generative models have greatly improved oneshot video-driven talking head generation, few studies address fine-grained controllable facial expression editing, which is crucial for practical applications. Existing methods rely on a fixed set of predefined discrete emotion labels or simply copy expressions from input videos. This is limiting as expressions are complex, and methods using only emotion labels cannot generate fine-grained, accurate or mixed expressions. Generating talking head video with precise expressions is also difficult using 3D model-based approaches, as 3DMM only models facial movements and tends to produce deviations. In this paper, we propose a novel framework enabling fine-grained facial expression editing in talking face generation. Our goal is to achieve expression control by manipulating the intensities of individual facial Action Units (AUs) or groups. First, compared with existing methods which decouple the face into pose and expression, we propose a disentanglement scheme to isolates three components from the human face, namely, appearance, pose, and expression. Second, we propose to use input AUs to control muscle group intensities in the generated face, and integrate the AUs features with the disentangled expression latent code. Finally, we present a self-supervised training strategy with well-designed constraints. Experiments show our method achieves fine-grained expression control, produces high-quality talking head videos and outperforms baseline methods.



Paperid:559
Authors:Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, Yunchao Wei
Institute of Information Science, Beijing Jiaotong University, China Beijing Key Laboratory of Advanced Information Science and Network Technology, Institute of Information Science, Beijing Jiaotong University, China Beijing Key Laboratory of Advanced Information Science and Network Technology, Institute of Information Science, Beijing Jiaotong University, China Beijing Key Laboratory of Advanced Information Science and Network Technology, School of Information Science and Engineering, Yanshan University, China Hebei Key Laboratory of Information Transmission and Signal Processing, Center for Frontier AI Research, IHPC, A*STAR, Singapore, Institute of Information Science, Beijing Jiaotong University, China Beijing Key Laboratory of Advanced Information Science and Network Technology
Abstract:
This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images despite limited training data. Existing frequencybased paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries. However, the rapid advancements in synthesis technology have led to specific artifacts for each generation model. Consequently, these detectors have exhibited a lack of proficiency in learning the frequency domain and tend to overfit to the artifacts present in the training data, leading to suboptimal performance on unseen sources. To address this issue, we introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors. Our method forces the detector to continuously focus on high-frequency information, exploiting high-frequency representation of features across spatial and channel dimensions. Additionally, we incorporate a straightforward frequency domain learning module to learn source-agnostic features. It involves convolutional layers applied to both the phase spectrum and amplitude spectrum between the Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (iFFT). Extensive experimentation involving 17 GANs demonstrates the effectiveness of our proposed method, showcasing state-of-the-art performance (+9.8\%) while requiring fewer parameters. The code is available at https://github.com/chuangchuangtan/FreqNet-DeepfakeDetection.



Paperid:560
Authors:Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, Xiangyu Zhang
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Megvii Technology, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China CAIR, HKISI, Chinese Academy of Sciences, Hong Kong, China, Megvii Technology
Abstract:
VisionLanguage Models (VLMs) such as CLIP have demonstrated remarkable generalization capabilities to downstream tasks. However, existing prompt tuning based frameworks need to parallelize learnable textual inputs for all categories, suffering from massive GPU memory consumption when there is a large number of categories in the target dataset. Moreover, previous works require to include category names within prompts, exhibiting subpar performance when dealing with ambiguous category names. To address these shortcomings, we propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces resource demand while achieving superior performance. We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly. Specifically, we found that compound text supervisions, i.e., category-wise and content-wise, is highly effective, since they provide inter-class separability and capture intra-class variations, respectively. Moreover, we condition the prompt generation on visual features through a module called Bonder, which facilitates the alignment between prompts and visual features. Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet. The code is available at https://github.com/EricTan7/TGP-T.



Paperid:561
Authors:Lei Tan, Jiaer Xia, Wenfeng Liu, Pingyang Dai, Yongjian Wu, Liujuan Cao
Xiamen University, Xiamen University, Xiamen University, Xiamen University, Tencent Technology (Shanghai) Co.,Ltd, Xiamen University
Abstract:
While generic person reidentification has made remarkable improvement in recent years, these methods are designed under the assumption that the entire body of the person is available. This assumption brings about a significant performance degradation when suffering from occlusion caused by various obstacles in real-world applications. To address this issue, data-driven strategies have emerged to enhance the model's robustness to occlusion. Following the random erasing paradigm, these strategies typically employ randomly generated noise to supersede randomly selected image regions to simulate obstacles. However, the random strategy is not sensitive to location and content, meaning they cannot mimic real-world occlusion cases in application scenarios. To overcome this limitation and fully exploit the real scene information in datasets, this paper proposes a more intuitive and effective data-driven strategy named Saliency-Guided Patch Transfer (SPT). Combined with the vision transformer, SPT divides person instances and background obstacles using salient patch selection. By transferring person instances to different background obstacles, SPT can easily generate photo-realistic occluded samples. Furthermore, we propose an occlusion-aware Intersection over Union (OIoU) with mask-rolling to filter the more suitable combination and a class-ignoring strategy to achieve more stable processing. Extensive experimental evaluations conducted on occluded and holistic person re-identification benchmarks demonstrate that SPT provides a significant performance gain among different ViT-based ReID algorithms on occluded ReID.



Paperid:562
Authors:Shuai Tan, Bin Ji, Ye Pan
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiaotong University
Abstract:
Although automatically animating audiodriven talking heads has recently received growing interest, previous efforts have mainly concentrated on achieving lip synchronization with the audio, neglecting two crucial elements for generating expressive videos: emotion style and art style. In this paper, we present an innovative audio-driven talking face generation method called Style2Talker. It involves two stylized stages, namely Style-E and Style-A, which integrate text-controlled emotion style and picture-controlled art style into the final output. In order to prepare the scarce emotional text descriptions corresponding to the videos, we propose a labor-free paradigm that employs large-scale pretrained models to automatically annotate emotional text labels for existing audio-visual datasets. Incorporating the synthetic emotion texts, the Style-E stage utilizes a large-scale CLIP model to extract emotion representations, which are combined with the audio, serving as the condition for an efficient latent diffusion model designed to produce emotional motion coefficients of a 3DMM model. Moving on to the Style-A stage, we develop a coefficient-driven motion generator and an art-specific style path embedded in the well-known StyleGAN. This allows us to synthesize high-resolution artistically stylized talking head videos using the generated emotional motion coefficients and an art style source picture. Moreover, to better preserve image details and avoid artifacts, we provide StyleGAN with the multi-scale content features extracted from the identity image and refine its intermediate feature maps by the designed content encoder and refinement network, respectively. Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.



Paperid:563
Authors:Shuai Tan, Bin Ji, Yu Ding, Ye Pan
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Netease Fuxi AI Lab, Shanghai Jiao Tong University
Abstract:
Generating stylized talking head with diverse head motions is crucial for achieving naturallooking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything with Any Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-the-art methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance as well.



Paperid:564
Authors:Zhaorui Tan, Xi Yang, Kaizhu Huang
Department of Intelligent Science, Xi’an Jiaotong-Liverpool University Department of Computer Science, University of Liverpool, Department of Intelligent Science, Xi’an Jiaotong-Liverpool University, Data Science Research Center, Duke Kunshan University
Abstract:
Data augmentation has been recently leveraged as an effective regularizer in various visionlanguage deep neural networks. However, in text-to-image synthesis (T2Isyn), current augmentation wisdom still suffers from the semantic mismatch between augmented paired data. Even worse, semantic collapse may occur when generated images are less semantically constrained. In this paper, we develop a novel Semantic-aware Data Augmentation (SADA) framework dedicated to T2Isyn. In particular, we propose to augment texts in the semantic space via an Implicit Textual Semantic Preserving Augmentation, in conjunction with a specifically designed Image Semantic Regularization Loss as Generated Image Semantic Conservation, to cope well with semantic mismatch and collapse. As one major contribution, we theoretically show that Implicit Textual Semantic Preserving Augmentation can certify better text-image consistency while Image Semantic Regularization Loss regularizing the semantics of generated images would avoid semantic collapse and enhance image quality. Extensive experiments validate that SADA enhances text-image consistency and improves image quality significantly in T2Isyn models across various backbones. Especially, incorporating SADA during the tuning process of Stable Diffusion models also yields performance improvements.



Paperid:565
Authors:Bowen Tang, Jing Zhang, Long Yan, Qian Yu, Lu Sheng, Dong Xu
Beihang University, Beihang University, Beihang University, Beihang University, Beihang University, The University of Hong Kong
Abstract:
Deep learning models have the ability to extract rich knowledge from largescale datasets. However, the sharing of data has become increasingly challenging due to concerns regarding data copyright and privacy. Consequently, this hampers the effective transfer of knowledge from existing data to novel downstream tasks and concepts. Zero-shot learning (ZSL) approaches aim to recognize new classes by transferring semantic knowledge learned from base classes. However, traditional generative ZSL methods often require access to real images from base classes and rely on manually annotated attributes, which presents challenges in terms of data restrictions and model scalability. To this end, this paper tackles a challenging and practical problem dubbed as data-free zero-shot learning (DFZSL), where only the CLIP-based base classes data pre-trained classifier is available for zero-shot classification. Specifically, we propose a generic framework for DFZSL, which consists of three main components. Firstly, to recover the virtual features of the base data, we model the CLIP features of base class images as samples from a von Mises-Fisher (vMF) distribution based on the pre-trained classifier. Secondly, we leverage the text features of CLIP as low-cost semantic information and propose a feature-language prompt tuning (FLPT) method to further align the virtual image features and textual features. Thirdly, we train a conditional generative model using the well-aligned virtual image features and corresponding semantic text features, enabling the generation of new classes features and achieve better zero-shot generalization. Our framework has been evaluated on five commonly used benchmarks for generalized ZSL, as well as 11 benchmarks for the base-to-new ZSL. The results demonstrate the superiority and effectiveness of our approach. Our code is available in https://github.com/ylong4/DFZSL.



Paperid:566
Authors:Chuanbo Tang, Xihua Sheng, Zhuoyuan Li, Haotian Zhang, Li Li, Dong Liu
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Unversity of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China
Abstract:
Video compression relies heavily on exploiting the temporal redundancy between video frames, which is usually achieved by estimating and using the motion information. The motion information is represented as optical flows in most of the existing deep video compression networks. Indeed, these networks often adopt pretrained optical flow estimation networks for motion estimation. The optical flows, however, may be less suitable for video compression due to the following two factors. First, the optical flow estimation networks were trained to perform inter-frame prediction as accurately as possible, but the optical flows themselves may cost too many bits to encode. Second, the optical flow estimation networks were trained on synthetic data, and may not generalize well enough to real-world videos. We address the twofold limitations by enhancing the optical flows in two stages: offline and online. In the offline stage, we fine-tune a trained optical flow estimation network with the motion information provided by a traditional (non-deep) video compression scheme, e.g. H.266/VVC, as we believe the motion information of H.266/VVC achieves a better rate-distortion trade-off. In the online stage, we further optimize the latent features of the optical flows with a gradient descent-based algorithm for the video to be compressed, so as to enhance the adaptivity of the optical flows. We conduct experiments on two state-of-the-art deep video compression schemes, DCVC and DCVC-DC. Experimental results demonstrate that the proposed offline and online enhancement together achieves on average 13.4% bitrate saving for DCVC and 4.1% bitrate saving for DCVC-DC on the tested videos, without increasing the model or computational complexity of the decoder side.



Paperid:567
Authors:Keke Tang, Xu He, Weilong Peng, Jianpeng Wu, Yawen Shi, Daizong Liu, Pan Zhou, Wenping Wang, Zhihong Tian
Cyberspace Institute of Advanced Technology, Guangzhou University, Cyberspace Institute of Advanced Technology, Guangzhou University, School of Computer Science and Cyber Engineering, Guangzhou University, Cyberspace Institute of Advanced Technology, Guangzhou University, Cyberspace Institute of Advanced Technology, Guangzhou University, Wangxuan Institute of Computer Technology, Peking University, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Department of Computer Science and Engineering, Texas A&M University, Cyberspace Institute of Advanced Technology, Guangzhou University
Abstract:
Adversarial attacks on 3D point clouds often exhibit unsatisfactory imperceptibility, which primarily stems from the disregard for manifoldaware distortion, i.e., distortion of the underlying 2-manifold surfaces. In this paper, we develop novel manifold constraints to reduce such distortion, aiming to enhance the imperceptibility of adversarial attacks on 3D point clouds. Specifically, we construct a bijective manifold mapping between point clouds and a simple parameter shape using an invertible auto-encoder. Consequently, manifold-aware distortion during attacks can be captured within the parameter space. By enforcing manifold constraints that preserve local properties of the parameter shape, manifold-aware distortion is effectively mitigated, ultimately leading to enhanced imperceptibility. Extensive experiments demonstrate that integrating manifold constraints into conventional adversarial attack solutions yields superior imperceptibility, outperforming the state-of-the-art methods.



Paperid:568
Authors:Long Tang, Dengpan Ye, Yunna Lv, Chuanxi Chen, Yunming Zhang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education , School of Cyber Science and Engineering ,Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education , School of Cyber Science and Engineering ,Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education , School of Cyber Science and Engineering ,Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education , School of Cyber Science and Engineering ,Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education , School of Cyber Science and Engineering ,Wuhan University
Abstract:
Deep Hashing (DH)based image retrieval has been widely applied to face-matching systems due to its accuracy and efficiency. However, this convenience comes with an increased risk of privacy leakage. DH models inherit the vulnerability to adversarial attacks, which can be used to prevent the retrieval of private images. Existing adversarial attacks against DH typically target a single image or a specific class of images, lacking universal adversarial perturbation for the entire hash dataset. In this paper, we propose the first universal transferable adversarial perturbation against DH-based facial image retrieval, a single perturbation can protect all images. Specifically, we explore the relationship between clusters learned by different DH models and define the optimization objective of universal perturbation as leaving from the overall hash center. To mitigate the challenge of single-objective optimization, we randomly obtain sub-cluster centers and further propose sub-task-based meta-learning to aid in overall optimization. We test our method with popular facial datasets and DH models, indicating impressive cross-image, -identity, -model, and -scheme universal anti-retrieval performance. Compared to state-of-the-art methods, our performance is competitive in white-box settings and exhibits significant improvements of 10%-70% in transferability in all black-box settings.



Paperid:569
Authors:Peng Tang, Zhiqiang Xu, Chunlai Zhou, Pengfei Wei, Peng Han, Xin Cao, Tobias Lasser
MBZUAI Techinical University of Munich, MBZUAI, Renmin University of China, AI Lab, Bytedance, University of Electronic Science and Technology of China, University of New South Wales, Technical University of Munich
Abstract:
Defocus blur, due to spatiallyvarying sizes and shapes, is hard to remove. Existing methods either are unable to effectively handle irregular defocus blur or fail to generalize well on other datasets. In this work, we propose a divide-and-conquer approach to tackling this issue, which gives rise to a novel end-to-end deep learning method, called prior-and-prediction inverse kernel transformer (P2IKT), for single image defocus deblurring. Since most defocus blur can be approximated as Gaussian blur or its variants, we construct an inverse Gaussian kernel module in our method to enhance its generalization ability. At the same time, an inverse kernel prediction module is introduced in order to flexibly address the irregular blur that cannot be approximated by Gaussian blur. We further design a scale recurrent transformer, which estimates mixing coefficients for adaptively combining the results from the two modules and runs the scale recurrent ``coarse-to-fine" procedure for progressive defocus deblurring. Extensive experimental results demonstrate that our P2IKT outperforms previous methods in terms of PSNR on multiple defocus deblurring datasets.



Paperid:570
Authors:Qi Tang, Yao Zhao, Meiqin Liu, Jian Jin, Chao Yao
Institute of Information Science, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China, Institute of Information Science, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China, Institute of Information Science, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China, Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China
Abstract:
As a critical clue of video superresolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is modeled as instances, events, and scenes via a Semantic Extractor. Those semantics assist the Pixel Enhancer in understanding the recovered contents and generating more realistic visual results. The distilled global semantics embody the scene information of each frame, while the instance-specific semantics assemble the spatial-temporal contexts related to each instance. Furthermore, we devise a Semantics-Powered Attention Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic knowledge, composed of a Global Perspective Shifter (GPS) and an Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module generates pairs of affine transformation parameters for pixel-level feature modulation conditioned on global semantics. After that the ISEE module harnesses the attention mechanism to align the adjacent frames in the instance-centric semantic space. In addition, we incorporate a simple yet effective pre-alignment module to alleviate the difficulty of model training. Extensive experiments demonstrate the superiority of our model over existing state-of-the-art VSR methods.



Paperid:571
Authors:Shengji Tang, Peng Ye, Baopu Li, Weihao Lin, Tao Chen, Tong He, Chong Yu, Wanli Ouyang
Fudan University, Fudan University, Independent Researcher, Fudan University, Fudan University, Shanghai AI lab, Fudan University, Shanghai AI lab
Abstract:
Recent research understands the residual networks from a new perspective of the implicit ensemble model. From this view, previous methods such as stochastic depth and stimulative training have further improved the performance of the residual network by sampling and training of its subnets. However, they both use the same supervision for all subnets of different capacities and neglect the valuable knowledge generated by subnets during training. In this manuscript, we mitigate the significant knowledge distillation gap caused by using the same kind of supervision and advocate leveraging the subnets to provide diverse knowledge. Based on this motivation, we propose a group knowledge based training framework for boosting the performance of residual networks. Specifically, we implicitly divide all subnets into hierarchical groups by subnetin-subnet sampling, aggregate the knowledge of different subnets in each group during training, and exploit upper-level group knowledge to supervise lower-level subnet group. Meanwhile, we also develop a subnet sampling strategy that naturally samples larger subnets, which are found to be more helpful than smaller subnets in boosting performance for hierarchical groups. Compared with typical subnet training and other methods, our method achieves the best efficiency and performance trade-offs on multiple datasets and network structures. The code is at https://github.com/tsj-001/AAAI24-GKT.



Paperid:572
Authors:Yiwen Tang, Ray Zhang, Zoey Guo, Xianzheng Ma, Bin Zhao, Zhigang Wang, Dong Wang, Xuelong Li
Northwestern Polytechnical University Shanghai AI Laboratory, Shanghai AI Laboratory, Shanghai AI Laboratory, Shanghai AI Laboratory, Northwestern Polytechnical University Shanghai AI Laboratory, Shanghai AI Laboratory, Shanghai AI Laboratory, Northwestern Polytechnical University Shanghai AI Laboratory
Abstract:
The popularity of pretrained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/Point-PEFT.



Paperid:573
Authors:Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, Qi Wu
Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, School of Cyberspace Science and Technology, Beijing Institute of Technology, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering,Chinese Academy of Sciences, University of Adelaide
Abstract:
Different from the Composed Image Retrieval task that requires expensive labels for training taskspecific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive attention to the reference image for various manipulation descriptions. In this paper, we propose a novel context-dependent mapping network, named Context-I2W, for adaptively converting description-relevant Image information into a pseudo-word token composed of the description for accurate ZS-CIR. Specifically, an Intent View Selector first dynamically learns a rotation rule to map the identical image to a task-specific manipulation view. Then a Visual Target Extractor further captures local information covering the main targets in ZS-CIR tasks under the guidance of multiple learnable queries. The two complementary modules work together to map an image to a context-dependent pseudo-word token without extra supervision. Our model shows strong generalization ability on four ZS-CIR tasks, including domain conversion, object composition, object manipulation, and attribute manipulation. It obtains consistent and significant performance boosts ranging from 1.88% to 3.60% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://anonymous.4open.science/r/Context-I2W-4224/.



Paperid:574
Authors:Zhangyong Tang, Tianyang Xu, Xiaojun Wu, Xue-Feng Zhu, Josef Kittler
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, PR. China, School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, PR. China, School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, PR. China, School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu, PR. China, Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford,GU2 7XH, UK
Abstract:
Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multimodal tracking has remained unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we combine these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. Based on this, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and four challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance by setting new records on GTOT, LasHeR and RGBD1K. Code will be available at https://github.com/Zhangyong-Tang/GMMT.



Paperid:575
Authors:Xinhao Tao, Junyan Cao, Yan Hong, Li Niu
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Ant Group, Shanghai Jiao Tong University
Abstract:
Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadows for the inserted foreground object to make the composite image more realistic. To supplement the existing smallscale dataset, we create a large-scale dataset called RdSOBA with rendering techniques. Moreover, we design a two-stage network named DMASNet with decomposed mask prediction and attentive shadow filling. Specifically, in the first stage, we decompose shadow mask prediction into box prediction and shape prediction. In the second stage, we attend to reference background shadow pixels to fill the foreground shadow. Abundant experiments prove that our DMASNet achieves better visual effects and generalizes well to real composite images.



Paperid:576
Authors:Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li
Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology
Abstract:
In recent years, textto-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.



Paperid:577
Authors:Wentao Tian, Zheng Wang, Yuqian Fu, Jingjing Chen, Lechao Cheng
Fudan University, Zhengjiang University of Technology, Fudan University, Fudan University, Zhejiang Lab
Abstract:
A comprehensive understanding of videos is inseparable from describing the action with its contextual actionobject interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE. Our code and dataset are available at https://github.com/Iriya99/OVRE.



Paperid:578
Authors:Yanling Tian, Di Chen, Yunan Liu, Jian Yang, Shanshan Zhang
Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University of Science and Technology Dalian Maritime University, Nanjing University of Science and Technology, Nanjing University of Science and Technology
Abstract:
Largescale pre-training has proven to be an effective method for improving performance across different tasks. Current person search methods use ImageNet pre-trained models for feature extraction, yet it is not an optimal solution due to the gap between the pre-training task and person search task (as a downstream task). Therefore, in this paper, we focus on pre-training for person search, which involves detecting and re-identifying individuals simultaneously. Although labeled data for person search is scarce, datasets for two sub-tasks person detection and re-identification are relatively abundant. To this end, we propose a hybrid pre-training framework specifically designed for person search using sub-task data only. It consists of a hybrid learning paradigm that handles data with different kinds of supervisions, and an intra-task alignment module that alleviates domain discrepancy under limited resources. To the best of our knowledge, this is the first work that investigates how to support full-task pre-training using sub-task data. Extensive experiments demonstrate that our pre-trained model can achieve significant improvements across diverse protocols, such as person search method, fine-tuning data, pre-training data and model backbone. For example, our model improves ResNet50 based NAE by 10.3% relative improvement w.r.t. mAP. Our code and pre-trained models are released for plug-and-play usage to the person search community (https://github.com/personsearch/PretrainPS).



Paperid:579
Authors:Kun Tong, Chengze Jiang, Jie Gui, Yuan Cao
Southeast University, Nanjing, China, Southeast University, Nanjing, China, Southeast University, Nanjing, China Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education, China Purple Mountain Laboratories, China, Ocean University of China, China
Abstract:
Adversarial training (AT) is an effective defense method against gradientbased attacks to enhance the robustness of neural networks. Among them, single-step AT has emerged as a hotspot topic due to its simplicity and efficiency, requiring only one gradient propagation in generating adversarial examples. Nonetheless, the problem of catastrophic overfitting (CO) that causes training collapse remains poorly understood, and there exists a gap between the robust accuracy achieved through single- and multi-step AT. In this paper, we present a surprising finding that the taxonomy of adversarial examples reveals the truth of CO. Based on this conclusion, we propose taxonomy driven fast adversarial training (TDAT) which jointly optimizes learning objective, loss function, and initialization method, thereby can be regarded as a new paradigm of single-step AT. Compared with other fast AT methods, TDAT can boost the robustness of neural networks, alleviate the influence of misclassified examples, and prevent CO during the training process while requiring almost no additional computational and memory resources. Our method achieves robust accuracy improvement of 1.59%, 1.62%, 0.71%, and 1.26% on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-100 datasets, when against projected gradient descent PGD10 attack with perturbation budget 8/255. Furthermore, our proposed method also achieves state-of-the-art robust accuracy against other attacks. Code is available at https://github.com/bookman233/TDAT.



Paperid:580
Authors:Xin Tong, Shi Peng, Yufei Guo, Xuhui Huang
Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC
Abstract:
In this paper, we propose a novel transformerbased end-to-end real-time vanishing point detection method, which is named Vanishing Point TRansformer (VPTR). The proposed method can directly regress the locations of vanishing points from given images. To achieve this goal, we pose vanishing point detection as a point object detection task on the Gaussian hemisphere with region division. Considering low-level features always provide more geometric information which can contribute to accurate vanishing point prediction, we propose a clear architecture where vanishing point queries in the decoder can directly gather multi-level features from CNN backbone with deformable attention in VPTR. Our method does not rely on line detection or Manhattan world assumption, which makes it more flexible to use. VPTR runs at an inferring speed of 140 FPS on one NVIDIA 3090 card. Experimental results on synthetic and real-world datasets demonstrate that our method can be used in both natural and structural scenes, and is superior to other state-of-the-art methods on the balance of accuracy and efficiency.



Paperid:581
Authors:Siddharth Tourani, Muhammad Haris Khan, Carsten Rother, Bogdan Savchynskyy
Computer Vision and Learning Lab, IWR, Heidelberg University MBZUAI, MBZUAI, Computer Vision and Learning Lab, IWR, Heidelberg University, Computer Vision and Learning Lab, IWR, Heidelberg University
Abstract:
We contribute to the sparsely populated area of unsupervised deep graph matching with application to keypoint matching in images. Contrary to the standard supervised approach, our method does not require ground truth correspondences between keypoint pairs. Instead, it is selfsupervised by enforcing consistency of matchings between images of the same object category. As the matching and the consistency loss are discrete, their derivatives cannot be straightforwardly used for learning. We address this issue in a principled way by building our method upon the recent results on black-box differentiation of combinatorial solvers. This makes our method exceptionally flexible, as it is compatible with arbitrary network architectures and combinatorial solvers. Our experimental evaluation suggests that our technique sets a new state-of-the-art for unsupervised graph matching.



Paperid:582
Authors:Esteve Valls Mascaró, Hyemin Ahn, Dongheui Lee
Technische Universität Wien (TU Wien), Vienna, Austria, Ulsan National Institute of Science and Technology, Technische Universität Wien (TU Wien), Vienna, Austria German Aerospace Center (DLR),
Abstract:
The synthesis of human motion has traditionally been addressed through taskdependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset while achieving state-of-the-art results in motion inbetweening on the LaFAN1 dataset for long transition periods.



Paperid:583
Authors:Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, France Inria, ENS, CNRS, PSL Research University, France, Inria, ENS, CNRS, PSL Research University, France, Inria, ENS, CNRS, PSL Research University, France, LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, France
Abstract:
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising imagetext-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.



Paperid:584
Authors:Thanh Vu, Baochen Sun, Bodi Yuan, Alex Ngai, Yueqi Li, Jan-Michael Frahm
University of North Carolina at Chapel Hill Mineral, Mineral, Mineral, Mineral, Mineral, University of North Carolina at Chapel Hill
Abstract:
The success of data mixing augmentations in image classification tasks has been wellreceived. However, these techniques cannot be readily applied to object detection due to challenges such as spatial misalignment, foreground/background distinction, and plurality of instances. To tackle these issues, we first introduce a novel conceptual framework called Supervision Interpolation (SI), which offers a fresh perspective on interpolation-based augmentations by relaxing and generalizing Mixup. Based on SI, we propose LossMix, a simple yet versatile and effective regularization that enhances the performance and robustness of object detectors and more. Our key insight is that we can effectively regularize the training on mixed data by interpolating their loss errors instead of ground truth labels. Empirical results on the PASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently outperform state-of-the-art methods widely adopted for detection. Furthermore, by jointly leveraging LossMix with unsupervised domain adaptation, we successfully improve existing approaches and set a new state of the art for cross-domain object detection.



Paperid:585
Authors:Chase Walker, Sumit Jha, Kenny Chen, Rickard Ewetz
University of Central Florida, Florida International University, Lockheed Martin, University of Central Florida
Abstract:
Attribution algorithms are frequently employed to explain the decisions of neural network models. Integrated Gradients (IG) is an influential attribution method due to its strong axiomatic foundation. The algorithm is based on integrating the gradients along a path from a reference image to the input image. Unfortunately, it can be observed that gradients computed from regions where the output logit changes minimally along the path provide poor explanations for the model decision, which is called the saturation effect problem. In this paper, we propose an attribution algorithm called integrated decision gradients (IDG). The algorithm focuses on integrating gradients from the region of the path where the model makes its decision, i.e., the portion of the path where the output logit rapidly transitions from zero to its final value. This is practically realized by scaling each gradient by the derivative of the output logit with respect to the path. The algorithm thereby provides a principled solution to the saturation problem. Additionally, we minimize the errors within the Riemann sum approximation of the path integral by utilizing nonuniform subdivisions determined by adaptive sampling. In the evaluation on ImageNet, it is demonstrated that IDG outperforms IG, Left-IG, Guided IG, and adversarial gradient integration both qualitatively and quantitatively using standard insertion and deletion metrics across three common models.



Paperid:586
Authors:Angtian Wang, Yuanlu Xu, Nikolaos Sarafianos, Robert Maier, Edmond Boyer, Alan Yuille, Tony Tung
Johns Hopkins University, Meta Reality Labs Research, Meta Reality Labs Research, Meta Reality Labs Research, Meta Reality Labs Research, Johns Hopkins University, Meta Reality Labs Research
Abstract:
Neural reconstruction and rendering strategies have demonstrated stateof-the-art performances due, in part, to their ability to preserve high level shape details. Existing approaches, however, either represent objects as implicit surface functions or neural volumes and still struggle to recover shapes with heterogeneous materials, in particular human skin, hair or clothes. To this aim, we present a new hybrid implicit surface representation to model human shapes. This representation is composed of two surface layers that represent opaque and translucent regions on the clothed human body. We segment different regions automatically using visual cues and learn to reconstruct two signed distance functions (SDFs). We perform surface-based rendering on opaque regions (e.g., body, face, clothes) to preserve high-fidelity surface normals and volume rendering on translucent regions (e.g., hair). Experiments demonstrate that our approach obtains state-of-the-art results on 3D human reconstructions, and also shows competitive performances on other objects.



Paperid:587
Authors:Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He
Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, SenseTime Research, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, Sun Yat-sen University, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory
Abstract:
The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of highquality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at https://opendatalab.github.io/VIGC



Paperid:588
Authors:Chenyang Wang, Junjun Jiang, Kui Jiang, Xianming Liu
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Abstract:
Human face captured at night or in dimly lit environments has become a common practice, accompanied by complex lowlight and low-resolution degradations. However, the existing face super-resolution (FSR) technologies and derived cascaded schemes are inadequate to recover credible textures. In this paper, we propose a novel approach that decomposes the restoration task into face structural fidelity maintaining and texture consistency learning. The former aims to enhance the quality of face images while improving the structural fidelity, while the latter focuses on eliminating perturbations and artifacts caused by low-light degradation and reconstruction. Based on this, we develop a novel low-light low-resolution face super-resolution framework. Our method consists of two steps: an illumination correction face super-resolution network (IC-FSRNet) for lighting the face and recovering the structural information, and a detail enhancement model (DENet) for improving facial details, thus making them more visually appealing and easier to analyze. As the relighted regions could provide complementary information to boost face super-resolution and vice versa, we introduce the mutual learning to harness the informative components from relighted regions and reconstruction, and achieve the iterative refinement. In addition, DENet equipped with diffusion probabilistic model is built to further improve face image visual quality. Experiments demonstrate that the proposed joint optimization framework achieves significant improvements in reconstruction quality and perceptual quality over existing two-stage sequential solutions. Code is available at https://github.com/wcy-cs/IC-FSRDENet.



Paperid:589
Authors:Cong Wang, Jinshan Pan, Wanyu Lin, Jiangxin Dong, Wei Wang, Xiao-Ming Wu
The Hong Kong Polytechnic University, Nanjing University of Science and Technology, The Hong Kong Polytechnic University, Nanjing University of Science and Technology, Dalian University of Technology, The Hong Kong Polytechnic University
Abstract:
This work presents an effective depthconsistency Self-Prompt Transformer, terms as SelfPromer, for image dehazing. It is motivated by an observation that the estimated depths of an image with haze residuals and its clear counterpart vary. Enforcing the depth consistency of dehazed images with clear ones, therefore, is essential for dehazing. For this purpose, we develop a prompt based on the features of depth differences between the hazy input images and corresponding clear counterparts that can guide dehazing models for better restoration. Specifically, we first apply deep features extracted from the input images to the depth difference features for generating the prompt that contains the haze residual information in the input. Then we propose a prompt embedding module that is designed to perceive the haze residuals, by linearly adding the prompt to the deep features. Further, we develop an effective prompt attention module to pay more attention to haze residuals for better removal. By incorporating the prompt, prompt embedding, and prompt attention into an encoder-decoder network based on VQGAN, we can achieve better perception quality. As the depths of clear images are not available at inference, and the dehazed images with one-time feed-forward execution may still contain a portion of haze residuals, we propose a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation. Extensive experiments show that our SelfPromer performs favorably against the state-of-the-art approaches on both synthetic and real-world datasets in terms of perception metrics including NIQE, PI, and PIQE. The source codes will be made available at https://github.com/supersupercong/SelfPromer.



Paperid:590
Authors:Cong Wang, Jinshan Pan, Wei Wang, Gang Fu, Siyuan Liang, Mengzhu Wang, Xiao-Ming Wu, Jun Liu
The Hong Kong Polytechnic University, Nanjing University of Science and Technology, Dalian University of Technology, The Hong Kong Polytechnic University, National University of Singapore, Hebei University of Technology, The Hong Kong Polytechnic University, Singapore University of Technology and Design
Abstract:
This paper proposes UHDformer, a general Transformer for UltraHigh-Definition (UHD) image restoration. UHDformer contains two learning spaces: (a) learning in high-resolution space and (b) learning in low-resolution space. The former learns multi-level high-resolution features and fuses low-high features and reconstructs the residual images, while the latter explores more representative features learning from the high-resolution ones to facilitate better restoration. To better improve feature representation in low-resolution space, we propose to build feature transformation from the high-resolution space to the low-resolution one. To that end, we propose two new modules: Dual-path Correlation Matching Transformation module (DualCMT) and Adaptive Channel Modulator (ACM). The DualCMT selects top C/r (r is greater or equal to 1 which controls the squeezing level) correlation channels from the max-pooling/mean-pooling high-resolution features to replace low-resolution ones in Transformers, which can effectively squeeze useless content to improve the feature representation in low-resolution space to facilitate better recovery. The ACM is exploited to adaptively modulate multi-level high-resolution features, enabling to provide more useful features to low-resolution space for better learning. Experimental results show that our UHDformer reduces about ninety-seven percent model sizes compared with most state-of-the-art methods while significantly improving performance under different training sets on 3 UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring. The source codes will be made available at https://github.com/supersupercong/UHDformer.



Paperid:591
Authors:Fei Wang, Dan Guo, Kun Li, Meng Wang
School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center Anhui Zhonghuitong Technology Co., Ltd, School of Computer Science and Information Engineering, Hefei University of Technology, School of Computer Science and Information Engineering, Hefei University of Technology Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
Video Motion Magnification (VMM) aims to break the resolution limit of human visual perception capability and reveal the imperceptible minor motion that contains valuable information in the macroscopic domain. However, challenges arise in this task due to photon noise inevitably introduced by photographic devices and spatial inconsistency in amplification, leading to flickering artifacts in static fields and motion blur and distortion in dynamic fields in the video. Existing methods focus on explicit motion modeling without emphasizing prioritized denoising during the motion magnification process. This paper proposes a novel dynamic filtering strategy to achieve staticdynamic field adaptive denoising. Specifically, based on Eulerian theory, we separate texture and shape to extract motion representation through inter-frame shape differences, expecting to leverage these subdivided features to solve this task finely. Then, we introduce a novel dynamic filter that eliminates noise cues and preserves critical features in the motion magnification and amplification generation phases. Overall, our unified framework, EulerMormer, is a pioneering effort to first equip with Transformer in learning-based VMM. The core of the dynamic filter lies in a global dynamic sparse cross-covariance attention mechanism that explicitly removes noise while preserving vital information, coupled with a multi-scale dual-path gating mechanism that selectively regulates the dependence on different frequency features to reduce spatial attenuation and complement motion boundaries. We demonstrate extensive experiments that EulerMormer achieves more robust video motion magnification from the Eulerian perspective, significantly outperforming state-of-the-art methods. The source code is available at https://github.com/VUT-HFUT/EulerMormer.



Paperid:592
Authors:Fengxiang Wang, Wanrong Huang, Shaowu Yang, Qi Fan, Long Lan
National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology, National University of Defense Technology, The Hong Kong University of Science and Technology, National University of Defense Technology
Abstract:
Prompt tuning provides a lowcost way of adapting vision-language models (VLMs) for various downstream vision tasks without requiring updating the huge pre-trained parameters. Dispensing with the conventional manual crafting of prompts, the recent prompt tuning method of Context Optimization (CoOp) introduces adaptable vectors as text prompts. Nevertheless, several previous works point out that the CoOp-based approaches are easy to overfit to the base classes and hard to generalize to novel classes. In this paper, we reckon that the prompt tuning works well only in the base classes because of the limited capacity of the adaptable vectors. The scale of the pre-trained model is hundreds times the scale of the adaptable vector, thus the learned vector has a very limited ability to absorb the knowledge of novel classes. To minimize this excessive overfitting of textual knowledge on the base class, we view prompt tuning as learning to learn (LoL) and learn the prompt in the way of meta-learning, the training manner of dividing the base classes into many different subclasses could fully exert the limited capacity of prompt tuning and thus transfer it power to recognize the novel classes. To be specific, we initially perform fine-tuning on the base class based on the CoOp method for pre-trained CLIP. Subsequently, predicated on the fine-tuned CLIP model, we carry out further fine-tuning in an N-way K-shot manner from the perspective of meta-learning on the base classes. We finally apply the learned textual vector and VLM for unseen classes.Extensive experiments on benchmark datasets validate the efficacy of our meta-learning-informed prompt tuning, affirming its role as a robust optimization strategy for VLMs.



Paperid:593
Authors:Guanjie Wang, Zehua Ma, Chang Liu, Xi Yang, Han Fang, Weiming Zhang, Nenghai Yu
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Sciense and Technology of China, National University of Singapore, University of Science and Technology of China, University of Science and Technology of China
Abstract:
In recent years, with the popularity of social media applications, massive digital images are available online, which brings great convenience to image recreation. However, the use of unauthorized image materials in multisource composite images is still inadequately regulated, which may cause significant loss and discouragement to the copyright owners of the source image materials. Ideally, deep watermarking techniques could provide a solution for protecting these copyrights based on their encoder-noise-decoder training strategy. Yet existing image watermarking schemes, which are mostly designed for single images, cannot well address the copyright protection requirements in this scenario, since the multi-source image composing process commonly includes distortions that are not well investigated in previous methods, e.g., the extreme downsizing. To meet such demands, we propose MuST, a multi-source tracing robust watermarking scheme, whose architecture includes a multi-source image detector and minimum external rectangle operation for multiple watermark resynchronization and extraction. Furthermore, we constructed an image material dataset covering common image categories and designed the simulation model of the multi-source image composing process as the noise layer. Experiments demonstrate the excellent performance of MuST in tracing sources of image materials from the composite images compared with SOTA watermarking methods, which could maintain the extraction accuracy above 98% to trace the sources of at least 3 different image materials while keeping the average PSNR of watermarked image materials higher than 42.51 dB. We released our code on https://github.com/MrCrims/MuST



Paperid:594
Authors:Haixin Wang, Jianlong Chang, Yihang Zhai, Xiao Luo, Jinan Sun, Zhouchen Lin, Qi Tian
National Engineering Research Center for Software Engineering, Peking University, Huawei Cloud & AI, National Engineering Research Center for Software Engineering, Peking University, School of Mathematical Sciences, Peking University, National Engineering Research Center for Software Engineering, Peking University, National Key Lab of General AI, School of Intelligence Science and Technology, Peking University Peng Cheng Laboratory, Huawei Cloud & AI
Abstract:
Despite recent promising performances across a range of vision tasks, vision Transformers still have an issue of high computational costs. Recently, vision prompt learning has provided an economical solution to this problem without finetuning the whole large-scale model. However, the efficiency and effectiveness of existing models are still far from satisfactory due to the parameter cost of extensive prompt blocks and tricky prompt framework designs. In this paper, we propose a light-weight prompt framework named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable low memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained backbone with parameters frozen. Moreover, according to the lottery hypothesis, we further prune the parameters to relieve the computation burden in implicit layers. Various experiments have validated that our LION obtains promising performances on a wide range of datasets. Most importantly, LION reduces up to 11.5 % of training parameter numbers while obtaining higher performance than the state-of-the-art VPT, especially under challenging scenes. Furthermore, we find that our proposed LION has an excellent generalization performance, making it an easy way to boost transfer learning in the future.



Paperid:595
Authors:Hao Wang, Qiang Song, Ruofeng Yin, Rui Ma
School of Artificial Intelligence, Jilin University, China-Japan Union Hospital, Jilin University, China-Japan Union Hospital, Jilin University, School of Artificial Intelligence, Jilin University Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China
Abstract:
Spinal curvature estimation is important to the diagnosis and treatment of the scoliosis. Existing methods face several issues such as the need of expensive annotations on the vertebral landmarks and being sensitive to the image quality. It is challenging to achieve robust estimation and obtain interpretable results, especially for lowquality images which are blurry and hazy. In this paper, we propose B-Spine, a novel deep learning pipeline to learn B-spline curve representation of the spine and estimate the Cobb angles for spinal curvature estimation from low-quality X-ray images. Given a low quality input, a novel SegRefine network which employs the unpaired image-to-image translation is proposed to generate a high quality spine mask from the initial segmentation result. Next, a novel mask-based B-spline prediction model is proposed to predict the B-spline curve for the spine centerline. Finally, the Cobb angles are estimated by a hybrid approach which combines the curve slope analysis and a curve based regression model. We conduct quantitative and qualitative comparisons with the representative and SOTA learning-based methods on the public AASCE2019 dataset and our new proposed JLU-CJUH dataset which contains more challenging low-quality images. The superior performance on both datasets shows our method can achieve both robustness and interpretability for spinal curvature estimation.



Paperid:596
Authors:Hao Wang, Fang Liu, Licheng Jiao, Jiahao Wang, Zehua Hao, Shuo Li, Lingling Li, Puhua Chen, Xu Liu
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Pretrained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost.



Paperid:597
Authors:Haoan Wang, Shilong Jia, Tieyong Zeng, Guixu Zhang, Zhi Li
East China Normal University, East China Normal University, The Chinese University of Hong Kong, East China Normal University, East China Normal University
Abstract:
In recent advancements concerning Domain Adaptive Object Detection (DAOD), unsupervised domain adaptation techniques have proven instrumental. These methods enable enhanced detection capabilities within unlabeled target domains by mitigating distribution differences between source and target domains. A subset of DAOD methods employs disentangled learning to segregate DomainSpecific Representations (DSR) and Domain-Invariant Representations (DIR), with ultimate predictions relying on the latter. Current practices in disentanglement, however, often lead to DIR containing residual domain-specific information. To address this, we introduce the Multi-level Disentanglement Module (MDM) that progressively disentangles DIR, enhancing comprehensive disentanglement. Additionally, our proposed Cyclic Disentanglement Module (CDM) facilitates DSR separation. To refine the process further, we employ the Categorical Features Disentanglement Module (CFDM) to isolate DIR and DSR, coupled with category alignment across scales for improved source-target domain alignment. Given its practical suitability, our model is constructed upon the foundational framework of the Single Shot MultiBox Detector (SSD), which is a one-stage object detection approach. Experimental validation highlights the effectiveness of our method, demonstrating its state-of-the-art performance across three benchmark datasets.



Paperid:598
Authors:Haoxiang Wang, Tao Yu, Tianwei Yang, Hui Qiao, Qionghai Dai
Department of Automation & BNRist, Tsinghua University, Department of Automation & BNRist, Tsinghua University, Department of Automation & BNRist, Tsinghua University, Department of Automation & BNRist, Tsinghua University Shanghai Artificial Intelligence Laboratory, Department of Automation & BNRist, Tsinghua University
Abstract:
We explore the generalization of the implicit representation in the physical simulation task. Traditional timedependent partial differential equations (PDEs) solvers for physical simulation often adopt the grid or mesh for spatial discretization, which is memory-consuming for high resolution and lack of adaptivity. Many implicit representations like local extreme machine or Siren are proposed but they are still too compact to suffer from limited accuracy in handling local details and a long time of convergence. We contribute a neural simulation framework based on multi-resolution hash grid representation to introduce hierarchical consideration of global and local information, simultaneously. Furthermore, we propose two key strategies: 1) a numerical gradient method for computing high-order derivatives with boundary conditions; 2) a range analysis sample method for fast neural geometry boundary sampling with dynamic topologies. Our method shows much higher accuracy and strong flexibility for various simulation problems: e.g., large elastic deformations, complex fluid dynamics, and multi-scale phenomena which remain challenging for existing neural physical solvers.



Paperid:599
Authors:Hebaixu Wang, Meiqi Gong, Xiaoguang Mei, Hao Zhang, Jiayi Ma
Wuhan University, Wuhan University, Wuhan University, Wuhan University, Wuhan University
Abstract:
Existing deep pansharpening methods lack the learning of complementary information between PAN and MS modalities in the intermediate layers, and exhibit low interpretability due to their black-box designs. To this end, an interpretable deep unfolded network with intrinsic supervision for pan-sharpening is proposed. Building upon the observation degradation process, it formulates the pan-sharpening task as a variational model minimization with spatial consistency prior and spectral projection prior. The former prior requires a joint component decomposition of PAN and MS images to extract intrinsic features. By being supervised in the intermediate layers, it can selectively provide high-frequency information for spatial enhancement. The latter prior constrains the intensity correlation between MS and PAN images derived from physical observations, so as to improve spectral fidelity. To further enhance the transparency of network design, we develop an iterative solution algorithm following the half-quadratic splitting to unfold the deep model. It rigorously adheres to the variational model, significantly enhancing the interpretability behind network design and efficiently alternating the optimization of the network. Extensive experiments demonstrate the advantages of our method compared to state-of-the-arts, showcasing its remarkable generalization capability to real-world scenes. Our code is publicly available at https://github.com/Baixuzx7/DISPNet.



Paperid:600
Authors:Hexiang Wang, Fengqi Liu, Qianyu Zhou, Ran Yi, Xin Tan, Lizhuang Ma
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, East China Normal University, Shanghai Jiao Tong University East China Normal University
Abstract:
Image animation aims to bring static images to life according to driving videos and create engaging visual content that can be used for various purposes such as animation, entertainment, and education. Recent unsupervised methods utilize affine and thinplate spline transformations based on keypoints to transfer the motion in driving frames to the source image. However, limited by the expressive power of the transformations used, these methods always produce poor results when the gap between the motion in the driving frame and the source image is large. To address this issue, we propose to model motion from the source image to the driving frame in highly-expressive diffeomorphism spaces. Firstly, we introduce Continuous Piecewise-Affine based (CPAB) transformation to model the motion and present a well-designed inference algorithm to generate CPAB transformation from control keypoints. Secondly, we propose a SAM-guided keypoint semantic loss to further constrain the keypoint extraction process and improve the semantic consistency between the corresponding keypoints on the source and driving images. Finally, we design a structure alignment loss to align the structure-related features extracted from driving and generated images, thus helping the generator generate results that are more consistent with the driving action. Extensive experiments on four datasets demonstrate the effectiveness of our method against state-of-the-art competitors quantitatively and qualitatively. Code will be publicly available at: https://github.com/DevilPG/AAAI2024-CPABMM.



Paperid:601
Authors:Hongyu Wang, Xiaotao Liu, Yifan Li, Meng Sun, Dian Yuan, Jing Liu
Guangzhou Institute of Technology, Xidian University, Guangzhou, China, Guangzhou Institute of Technology, Xidian University, Guangzhou, China, Guangzhou Institute of Technology, Xidian University, Guangzhou, China, Guangzhou Institute of Technology, Xidian University, Guangzhou, China, Guangzhou Institute of Technology, Xidian University, Guangzhou, China, Guangzhou Institute of Technology, Xidian University, Guangzhou, China
Abstract:
RGBT tracking has been widely used in various fields such as robotics, surveillance processing, and autonomous driving. Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results. However, these RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training. The former struggles to cope with the object state changes, while the latter neglects the correlation between spatial and temporal information. To alleviate these limitations, we propose a novel Temporal Adaptive RGBT Tracking framework, named as TATrack. TATrack has a spatiotemporal two-stream structure and captures temporal information by an online updated template, where the two-stream structure refers to the multi-modal feature extraction and cross-modal interaction for the initial template and the online update template respectively. TATrack contributes to comprehensively exploit spatio-temporal information and multi-modal information for target localization. In addition, we design a spatio-temporal interaction (STI) mechanism that bridges two branches and enables cross-modal interaction to span longer time scales. Extensive experiments on three popular RGBT tracking benchmarks show that our method achieves state-of-the-art performance, while running at real-time speed.



Paperid:602
Authors:Jiahao Wang, Caixia Yan, Weizhan Zhang, Huan Liu, Hao Sun, Qinghua Zheng
School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University, School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University, School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University, School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University, China Telecom Artificial Intelligence Technology Co.Ltd, School of Computer Science and Technology, MOEKLINNS Laboratory, Xi'an Jiaotong University
Abstract:
Zeroshot object detection (ZSD) aims to localize and classify unseen objects without access to their training annotations. As a prevailing solution to ZSD, generation-based methods synthesize unseen visual features by taking seen features as reference and class semantic embeddings as guideline. Although previous works continuously improve the synthesis quality, they fail to consider the scale-varying nature of unseen objects. The generation process is preformed over a single scale of object features and thus lacks scale-diversity among synthesized features. In this paper, we reveal the scale-varying challenge in ZSD and propose a Scale-Aware Unseen Imagineer (SAUI) to lead the way of a novel scale-aware ZSD paradigm. To obtain multi-scale features of seen-class objects, we design a specialized coarse-to-fine extractor to capture features through multiple scale-views. To generate unseen features scale by scale, we innovate a Series-GAN synthesizer along with three scale-aware contrastive components to imagine separable, diverse and robust scale-wise unseen features. Extensive experiments on PASCAL VOC, COCO and DIOR datasets demonstrate SAUI's better performance in different scenarios, especially for scale-varying and small objects. Notably, SAUI achieves the new state-of-the art performance on COCO and DIOR.



Paperid:603
Authors:Jiangang Wang, Yuning Cui, Yawen Li, Wenqi Ren, Xiaochun Cao
Shenzhen Campus of Sun Yat-sen University, Technical University of Munich, Beijing University of Posts and Telecommunications, Shenzhen Campus of Sun Yat-sen University, Shenzhen Campus of Sun Yat-sen University
Abstract:
With the rapid development of virtual reality, omnidirectional images (ODIs) have attracted much attention from both the industrial community and academia. However, due to storage and transmission limitations, the resolution of current ODIs is often insufficient to provide an immersive virtual reality experience. Previous approaches address this issue using conventional 2D superresolution techniques on equirectangular projection without exploiting the unique geometric properties of ODIs. In particular, the equirectangular projection (ERP) provides a complete field-of-view but introduces significant distortion, while the cubemap projection (CMP) can reduce distortion yet has a limited field-of-view. In this paper, we present a novel Bi-Projection Omnidirectional Image Super-Resolution (BPOSR) network to take advantage of the geometric properties of the above two projections. Then, we design two tailored attention methods for these projections: Horizontal Striped Transformer Block (HSTB) for ERP and Perspective Shift Transformer Block (PSTB) for CMP. Furthermore, we propose a fusion module to make these projections complement each other. Extensive experiments demonstrate that BPOSR achieves state-of-the-art performance on omnidirectional image super-resolution. The code is available at https://github.com/W-JG/BPOSR.



Paperid:604
Authors:Jing Wang, Jiangyun Li, Chen Chen, Yisi Zhang, Haoran Shen, Tianxiang Zhang
School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing, China, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing, China, Center for Research in Computer Vision, University of Central Florida, Orlando, USA, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing, China, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing, China, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing, China
Abstract:
The FewShot Segmentation (FSS) aims to accomplish the novel class segmentation task with a few annotated images. Current FSS research based on meta-learning focuses on designing a complex interaction mechanism between the query and support feature. However, unlike humans who can rapidly learn new things from limited samples, the existing approach relies solely on fixed feature matching to tackle new tasks, lacking adaptability. In this paper, we propose a novel framework based on the adapter mechanism, namely Adaptive FSS, which can efficiently adapt the existing FSS model to the novel classes. In detail, we design the Prototype Adaptive Module (PAM), which utilizes accurate category information provided by the support set to derive class prototypes, enhancing class-specific information in the multi-stage representation. In addition, our approach is compatible with diverse FSS methods with different backbones by simply inserting PAM between the layers of the encoder. Experiments demonstrate that our method effectively improves the performance of the FSS models (e.g., MSANet, HDMNet, FPTrans, and DCAMA) and achieves new state-of-the-art (SOTA) results (i.e., 72.4% and 79.1% mIoU on PASCAL-5i 1-shot and 5-shot settings, 52.7% and 60.0% mIoU on COCO-20i 1-shot and 5-shot settings). Our code is available at https://github.com/jingw193/AdaptiveFSS.



Paperid:605
Authors:Jun Wang, Ying Cui, Dongyan Guo, Junxia Li, Qingshan Liu, Chunhua Shen
Zhejiang University of Technology, Zhejiang University of Technology, Zhejiang University of Technology, Nanjing University of Information Science & Technology, Nanjing University of Information Science & Technology, Zhejiang University
Abstract:
Point cloud completion referring to completing 3D shapes from partial 3D point clouds is a fundamental problem for 3D point cloud analysis tasks. Benefiting from the development of deep neural networks, researches on point cloud completion have made great progress in recent years. However, the explicit local region partition like kNNs involved in existing methods makes them sensitive to the density distribution of point clouds. Moreover, it serves limited receptive fields that prevent capturing features from longrange context information. To solve the problems, we leverage the cross-attention and self-attention mechanisms to design novel neural network for point cloud completion with implicit local region partition. Two basic units Geometric Details Perception (GDP) and Self-Feature Augment (SFA) are proposed to establish the structural relationships directly among points in a simple yet effective way via attention mechanism. Then based on GDP and SFA, we construct a new framework with popular encoder-decoder architecture for point cloud completion. The proposed framework, namely PointAttN, is simple, neat and effective, which can precisely capture the structural information of 3D shapes and predict complete point clouds with detailed geometry. Experimental results demonstrate that our PointAttN outperforms state-of-the-art methods on multiple challenging benchmarks. Code is available at: https://github.com/ohhhyeahhh/PointAttN



Paperid:606
Authors:Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, Yanfei Zhong
Wuhan University, Stanford University, Wuhan University, Wuhan University, Wuhan University
Abstract:
Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multimodal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA.



Paperid:607
Authors:Kewei Wang, Yizheng Wu, Zhiyu Pan, Xingyi Li, Ke Xian, Zhe Wang, Zhiguo Cao, Guosheng Lin
Key Laboratory of Image Processing and Intelligent Control, Ministry of Education School of Artificial Intelligence and Automation, Huazhong University of Science and Technology S-Lab, Nanyang Technological University, Key Laboratory of Image Processing and Intelligent Control, Ministry of Education School of Artificial Intelligence and Automation, Huazhong University of Science and Technology S-Lab, Nanyang Technological University, Key Laboratory of Image Processing and Intelligent Control, Ministry of Education School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Key Laboratory of Image Processing and Intelligent Control, Ministry of Education School of Artificial Intelligence and Automation, Huazhong University of Science and Technology S-Lab, Nanyang Technological University, S-Lab, Nanyang Technological University, SenseTime Research, Key Laboratory of Image Processing and Intelligent Control, Ministry of Education School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, S-Lab, Nanyang Technological University
Abstract:
Classagnostic motion prediction methods aim to comprehend motion within open-world scenarios, holding significance for autonomous driving systems. However, training a high-performance model in a fully-supervised manner always requires substantial amounts of manually annotated data, which can be both expensive and time-consuming to obtain. To address this challenge, our study explores the potential of semi-supervised learning (SSL) for class-agnostic motion prediction. Our SSL framework adopts a consistency-based self-training paradigm, enabling the model to learn from unlabeled data by generating pseudo labels through test-time inference. To improve the quality of pseudo labels, we propose a novel motion selection and re-generation module. This module effectively selects reliable pseudo labels and re-generates unreliable ones. Furthermore, we propose two data augmentation strategies: temporal sampling and BEVMix. These strategies facilitate consistency regularization in SSL. Experiments conducted on nuScenes demonstrate that our SSL method can surpass the self-supervised approach by a large margin by utilizing only a tiny fraction of labeled data. Furthermore, our method exhibits comparable performance to weakly and some fully supervised methods. These results highlight the ability of our method to strike a favorable balance between annotation costs and performance. Code will be available at https://github.com/kwwcv/SSMP.



Paperid:608
Authors:Keyao Wang, Guosheng Zhang, Haixiao Yue, Ajian Liu, Gang Zhang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang
Baidu Inc., Baidu Inc., Baidu Inc., University of Chinese Academy of Sciences, Baidu Inc., Baidu Inc., Baidu Inc., Baidu Inc., Baidu Inc.
Abstract:
Previous face Presentation Attack Detection (PAD) methods aim to improve the effectiveness of crossdomain tasks. However, in real-world scenarios, the original training data of the pre-trained model is not available due to data privacy or other reasons. Under these constraints, general methods for fine-tuning single-target domain data may lose previously learned knowledge, leading to a catastrophic forgetting problem. To address these issues, we propose a multi-domain incremental learning (MDIL) method for PAD, which not only learns knowledge well from the new domain but also maintains the performance of previous domains stably. Specifically, we propose an adaptive domain-specific experts (ADE) framework based on the vision transformer to preserve the discriminability of previous domains. Furthermore, an asymmetric classifier is designed to keep the output distribution of different classifiers consistent, thereby improving the generalization ability. Extensive experiments show that our proposed method achieves state-of-the-art performance compared to prior methods of incremental learning. Excitingly, under more stringent setting conditions, our method approximates or even outperforms the DA/DG-based methods.



Paperid:609
Authors:Kun Wang, Zhiqiang Yan, Huang Tian, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang
Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University, Suzhou Campus, Nankai University, Nanjing University of Science and Technology, Nanjing University of Science and Technology
Abstract:
Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRFa novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.



Paperid:610
Authors:Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong Liu
Zhejiang University, Zhejiang University, Youtu Lab,Tencent, Zhejiang University, Zhejiang University, Technical University of Munich, State Grid Corporation of China, Baidu Inc, Zhejiang University
Abstract:
Recently, the rise of largescale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals, including the original contrastive learning head, a cross-modal classification head, a cross-modal masked language modeling head, and a visual classification head. This multi-task decoder adeptly satisfies the need for strong supervised performance within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.



Paperid:611
Authors:Miaohui Wang, Runnan Huang, Hengjin Dong, Di Lin, Yun Song, Wuyuan Xie
Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, College of Intelligence and Computing, Tianjin University, Tianjin 300072, School of Computer & Communication Engineering, Changsha University of Science & Technology, Changsha 410004, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060
Abstract:
LiDAR sensors are widely used in autonomous driving, and the growing storage and transmission demands have made LiDAR point cloud compression (LPCC) a hot research topic. To address the challenges posed by the largescale and uneven-distribution (spatial and categorical) of LiDAR point data, this paper presents a new multimodal-driven scalable LPCC framework. For the large-scale challenge, we decouple the original LiDAR data into multi-layer point subsets, compress and transmit each layer separately, so as to ensure the reconstruction quality requirement under different scenarios. For the uneven-distribution challenge, we extract, align, and fuse heterologous feature representations, including point modality with position information, depth modality with spatial distance information, and segmentation modality with category information. Extensive experimental results on the benchmark SemanticKITTI database validate that our method outperforms 14 recent representative LPCC methods.



Paperid:612
Authors:Ning Wang, Jiajun Deng, Mingbo Jia
Huawei Inc., University of Adelaide, Australian Institute for Machine Learning, Huawei Inc.
Abstract:
We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclicconsistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks.



Paperid:613
Authors:Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin
OPPO Research Institute, South China University of Technology, OPPO Research Institute, OPPO Research Institute, OPPO Research Institute, Rutgers University
Abstract:
Recent textto-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, they fail to semantically align the generated images with the prompts due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, a unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective and can be readily integrated into existing cross-attention-based T2I generators. We compare our approach to competing methods and demonstrate that it can faithfully convey the semantics of the original text to the generated content and achieve high availability as a ready-to-use plugin. Please refer to https://github.com/OPPO-Mente-Lab/attention-mask-control.



Paperid:614
Authors:Ruikui Wang, Yuanfang Guo, Yunhong Wang
State Key Laboratory of Software Development Environment, Beihang University, China School of Computer Science and Engineering, Beihang University, China, State Key Laboratory of Software Development Environment, Beihang University, China School of Computer Science and Engineering, Beihang University, China, School of Computer Science and Engineering, Beihang University, China
Abstract:
In practical blackbox attack scenarios, most of the existing transfer-based attacks employ pretrained models (e.g. ResNet50) as the substitute models. Unfortunately, these substitute models are not always appropriate for transfer-based attacks. Firstly, these models are usually trained on a largescale annotated dataset, which is extremely expensive and time-consuming to construct. Secondly, the primary goal of these models is to perform a specific task, such as image classification, which is not developed for adversarial attacks. To tackle the above issues, i.e., high cost and over-fitting on taskspecific models, we propose an Affordable and Generalizable Substitute (AGS) training framework tailored for transferbased adversarial attack. Specifically, we train the substitute model from scratch by our proposed adversary-centric constrastive learning. This proposed learning mechanism introduces another sample with slight adversarial perturbations as an additional positive view of the input image, and then encourages the adversarial view and two benign views to interact comprehensively with each other. To further boost the generalizability of the substitute model, we propose adversarial invariant learning to maintain the representations of the adversarial example invariants under augmentations with various strengths. Our AGS model can be trained solely with unlabeled and out-of domain data and avoid overfitting to any task-specific models, because of its inherently self-supervised nature. Extensive experiments demonstrate that our AGS achieves comparable or superior performance compared to substitute models pretrained on the complete ImageNet training set, when executing attacks across a diverse range of target models, including ViTs, robustly trained models, object detection and segmentation models. Our source codes are available at https://github.com/lwmming/AGS.



Paperid:615
Authors:Ruilu Wang, Yang Xue, Lianwen Jin
South China University of Technology, South China University of Technology, South China University of Technology
Abstract:
Document Image Enhancement (DIE) remains challenging due to the prevalence of multiple degradations in document images captured by cameras. In this paper, we respond an interesting question: can the performance of pretrained models and downstream DIE models be improved if they are bootstrapped using different degradation types of the same semantic samples and their high-dimensional features with ambiguous inter-class distance? To this end, we propose an effective contrastive learning paradigm for DIE — a Document image enhancement framework with Normalization and Latent Contrast (DocNLC). While existing DIE methods focus on eliminating one type of degradation, DocNLC considers the relationship between different types of degradation while utilizing both direct and latent contrasts to constrain content consistency, thus achieving a unified treatment of multiple types of degradation. Specifically, we devise a latent contrastive learning module to enforce explicit decorrelation of the normalized representations of different degradation types and to minimize the redundancy between them. Comprehensive experiments show that our method outperforms state-of-the-art DIE models in both pre-training and fine-tuning stages on four publicly available independent datasets. In addition, we discuss the potential benefits of DocNLC for downstream tasks. Our code is released at https://github.com/RylonW/DocNLC



Paperid:616
Authors:Ruofan Wang, Rui-Wei Zhao, Xiaobo Zhang, Rui Feng
School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, Academy for Engineering and Technology, Fudan University, Shanghai, Children’s Hospital of Fudan University, National Children’s Medical Center, Shanghai, China, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai Academy for Engineering and Technology, Fudan University, Shanghai Children’s Hospital of Fudan University, National Children’s Medical Center, Shanghai, China Shanghai Collaborative Innovation Center of Intelligent Visual Computing
Abstract:
Detecting in openworld scenarios poses a formidable challenge for models intended for real-world deployment. The advanced closed set object detectors achieve impressive performance under the closed set setting, but often produce overconfident misprediction on unknown objects due to the lack of supervision. In this paper, we propose a novel Evidential Object Detector (EOD) to formulate the Open Set Object Detection (OSOD) problem from the perspective of Evidential Deep Learning (EDL) theory, which quantifies classification uncertainty by placing the Dirichlet Prior over the categorical distribution parameters. The task-specific customized evidential framework, equipped with meticulously designed model architecture and loss function, effectively bridges the gap between EDL theory and detection tasks. Moreover, we utilize contrastive learning as an implicit means of evidential regularization and to encourage the class separation in the latent space. Alongside, we innovatively model the background uncertainty to further improve the unknown discovery ability. Extensive experiments on benchmark datasets demonstrate the outperformance of the proposed method over existing ones.



Paperid:617
Authors:Shijing Wang, Yaping Huang
Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, China, Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, China
Abstract:
Uncertainty in gaze estimation manifests in two aspects: 1) lowquality images caused by occlusion, blurriness, inconsistent eye movements, or even non-face images; 2) uncorrected labels resulting from the misalignment between the labeled and actual gaze points during the annotation process. Allowing these uncertainties to participate in training hinders the improvement of gaze estimation. To tackle these challenges, in this paper, we propose an effective solution, named Suppressing Uncertainty in Gaze Estimation (SUGE), which introduces a novel triplet-label consistency measurement to estimate and reduce the uncertainties. Specifically, for each training sample, we propose to estimate a novel ``neighboring label'' calculated by a linearly weighted projection from the neighbors to capture the similarity relationship between image features and their corresponding labels, which can be incorporated with the predicted pseudo label and ground-truth label for uncertainty estimation. By modeling such triplet-label consistency, we can largely reduce the negative effects of unqualified images and wrong labels through our designed sample weighting and label correction strategies. Experimental results on the gaze estimation benchmarks indicate that our proposed SUGE achieves state-of-the-art performance.



Paperid:618
Authors:Shuo Wang, Zhihao Wu, Xiaobo Hu, Jinwen Wang, Youfang Lin, Kai Lv
School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China National Key Laboratory of Air-based Information Perception and Fusion, Luoyang, China, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China National Key Laboratory of Air-based Information Perception and Fusion, Luoyang, China
Abstract:
In visual Reinforcement Learning (RL), the challenge of generalization to new environments is paramount. This study pioneers a theoretical analysis of visual RL generalization, establishing an upper bound on the generalization objective, encompassing policy divergence and Bellman error components. Motivated by this analysis, we propose maintaining the crossdomain consistency for each policy in the policy space, which can reduce the divergence of the learned policy during the test. In practice, we introduce the Truncated Return Prediction (TRP) task, promoting cross-domain policy consistency by predicting truncated returns of historical trajectories. Moreover, we also propose a Transformer-based predictor for this auxiliary task. Extensive experiments on DeepMind Control Suite and Robotic Manipulation tasks demonstrate that TRP achieves state-of-the-art generalization performance. We further demonstrate that TRP outperforms previous methods in terms of sample efficiency during training.



Paperid:619
Authors:Tianqi Wang, Sukmin Kim, Ji Wenxuan, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, Ping Luo
The University of Hong Kong, The University of Hong Kong, The University of Hong Kong, Huawei Noah's Ark Lab, The University of Hong Kong, Dalian University of Technology, Huawei Noah's Ark Lab, The University of Hong Kong
Abstract:
Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a largescale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset with 40k annotated samples. In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms. Furthermore, for each scenario, we set four vehicles along with one infrastructure to record data, thus providing diverse viewpoints for accident scenarios and enabling V2X (vehicle-to-everything) research on perception and prediction tasks. Finally, we present a baseline V2X model named V2XFormer that demonstrates superior performance for motion and accident prediction and 3D object detection compared to the single-vehicle model.



Paperid:620
Authors:Weishuai Wang, Ting Lei, Qingchao Chen, Yang Liu
Wangxuan Institute of Computer Technology, Peking University, Wangxuan Institute of Computer Technology, Peking University, National Institute of Health Data Science, Peking University, Wangxuan Institute of Computer Technology, Peking University
Abstract:
The Novel Category Discovery problem aims to cluster an unlabeled set with the help of a labeled set consisting of disjoint but related classes. However, existing models treat class names as discrete onehot labels and ignore the semantic understanding of these classes. In this paper, we propose a new setting named Semantic-guided Novel Category Discovery (SNCD), which requires the model to not only cluster the unlabeled images but also semantically recognize these images based on a set of their class names. The first challenge we confront pertains to effectively leveraging the class names of unlabeled images, given the inherent gap between the visual and linguistic domains. To address this issue, we incorporate a semantic-aware recognition mechanism. This is achieved by constructing dynamic class-wise visual prototypes as well as a semantic similarity matrix that enables the projection of visual features into the semantic space. The second challenge originates from the granularity disparity between the classification and clustering tasks. To deal with this, we develop a semantic-aware clustering process to facilitate the exchange of knowledge between the two tasks. Through extensive experiments, we demonstrate the mutual benefits of the recognition and clustering tasks, which can be jointly optimized. Experimental results on multiple datasets confirm the effectiveness of our proposed method. Our code is available at https://github.com/wang-weishuai/Semantic-guided-NCD.



Paperid:621
Authors:Xiao Wang, Zongzhen Wu, Bo Jiang, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, Yonghong Tian
School of Computer Science and Technology, Anhui University, School of Computer Science and Technology, Anhui University, School of Computer Science and Technology, Anhui University, Tencent, Beijing Institute of Technology, University of Chinese Academy of Sciences Peng Cheng Laboratory, Peng Cheng Laboratory, Peking University Peng Cheng Laboratory
Abstract:
The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which usually suffer from illumination, fast motion, privacy preservation, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic largescale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released at https://github.com/Event-AHU/HARDVS.



Paperid:622
Authors:Xiao Wang, Wentao Wu, Chenglong Li, Zhicheng Zhao, Zhe Chen, Yukai Shi, Jin Tang
Anhui University, Anhui University, Anhui University, Anhui University, La Trobe University, Guangdong University of Technology, Anhui University
Abstract:
Understanding vehicles in images is important for various applications such as intelligent transportation and selfdriving system. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly extract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE. The source code and pre-trained models will be released at https://github.com/Event-AHU/VehicleMAE.



Paperid:623
Authors:Xijun Wang, Anqi Liang, Junbang Liang, Ming Lin, Yu Lou, Shan Yang
University of Maryland, College Park, USA Amazon, USA, Amazon, USA, Amazon, USA, University of Maryland, College Park, USA Amazon, USA, Amazon, USA, Amazon, USA
Abstract:
Sceneaware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual ``scene-based set compatibility reasoning'' with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a ``Flexible Bidirectional Transformer (FBT),'' consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively.



Paperid:624
Authors:Xinshun Wang, Qiongjie Cui, Chen Chen, Mengyuan Liu
School of Intelligent Systems Engineering, Sun Yat-sen University National Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Xiaohongshu Inc., Center for Research in Computer Vision, University of Central Florida, National Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
Abstract:
The past few years has witnessed the dominance of Graph Convolutional Networks (GCNs) over human motion prediction. Various styles of graph convolutions have been proposed, with each one meticulously designed and incorporated into a carefullycrafted network architecture. This paper breaks the limits of existing knowledge by proposing Universal Graph Convolution (UniGC), a novel graph convolution concept that re-conceptualizes different graph convolutions as its special cases. Leveraging UniGC on network-level, we propose GCNext, a novel GCN-building paradigm that dynamically determines the best-fitting graph convolutions both sample-wise and layer-wise. GCNext offers multiple use cases, including training a new GCN from scratch or refining a preexisting GCN. Experiments on Human3.6M, AMASS, and 3DPW datasets show that, by incorporating unique module-to-network designs, GCNext yields up to 9x lower computational cost than existing GCN methods, on top of achieving state-of-the-art performance. Our code is available at https://github.com/BradleyWang0416/GCNext.



Paperid:625
Authors:Yabing Wang, Fan Wang, Jianfeng Dong, Hao Luo
Zhejiang Gongshang University and Xi’an Jiaotong University and DAMO Academy, Alibaba Group, DAMO Academy, Alibaba Group, Zhejiang Gongshang University and Zhejiang Key Lab of E-Commerce, DAMO Academy, Alibaba Group and Hupan Lab, Zhejiang Province
Abstract:
Crosslingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.



Paperid:626
Authors:Yang Wang, Tao Zhang
Jiangnan University, Jiangnan University
Abstract:
Recently, several lightweight methods have been proposed to implement singleimage super-resolution (SISR) on resource-constrained devices. However, these methods primarily focus on simplifying network structures without the full utilization of shallow features. The fact remains that shallow features encompass crucial details for the super-resolution task, including edges, textures, and colors. Therefore, developing a novel architecture that can effectively integrate features from different levels and capitalize on their mutual complementarity is necessary. We first analyze the relationship between multi-stage features and the restoration tasks in a classic lightweight SR method. Based on these observations, we propose an Omni-Stage Feature Fusion (OSFF) architecture, which incorporates Original Image Stacked Initialisation, Shallow Feature Global Connection, and Multi-Receptive Field Dynamic Fusion. An Attention-Enhanced Feature Distillation module is also designed to enhance the model performance. Finally, leveraging these contributions, we construct an Omni-Stage Feature Fusion Network (OSFFNet). Through extensive experiments on various benchmark datasets, the proposed model outperforms state-of-the-art methods. Notably, it achieves a 0.26dB PSNR improvement over the second-best method for x2 SR on the Urban100 dataset.



Paperid:627
Authors:Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li
Renmin University of China, Northwestern Polytechnical University, Renmin University of China, Wuhan University, Renmin University of China, Zhejiang University
Abstract:
Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the AudioVisual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct a Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios. Project page: https://github.com/GeWu-Lab/Generalizable-Audio-Visual-Segmentation



Paperid:628
Authors:Yasi Wang, Hong Liu, Chao Zhang, Lu Xu, Qiang Wang
Samsung Research China - Beijing, Eindhoven University of Technology, Samsung Research China - Beijing, Samsung Research China - Beijing, Samsung Research China - Beijing
Abstract:
Homography estimation is a fundamental problem in computer vision. Previous works mainly focus on estimating either a single homography, or multiple homographies based on mesh grid division of the image. In practical scenarios, single homography is inadequate and often leads to a compromised result for multiple planes; while mesh grid multihomography damages the plane distribution of the scene, and does not fully address the restriction to use homography. In this work, we propose a novel semantics guided multi-homography estimation framework, Mask-Homo, to provide an explicit solution to the multi-plane depth disparity problem. First, a pseudo plane mask generation module is designed to obtain multiple correlated regions that follow the plane distribution of the scene. Then, multiple local homography transformations, each of which aligns a correlated region precisely, are predicted and corresponding warped images are fused to obtain the final result. Furthermore, a new metric, Mask-PSNR, is proposed for more comprehensive evaluation of alignment. Extensive experiments are conducted to verify the effectiveness of the proposed method. Our code is available at https://github.com/SAITPublic/MaskHomo.



Paperid:629
Authors:Yi Wang, Jiaze Wang, Jinpeng Li, Zixu Zhao, Guangyong Chen, Anfeng Liu, Pheng Ann Heng
Central South University, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, Zhejiang Lab, Central South University, The Chinese University of Hong Kong
Abstract:
Data augmentation is an effective regularization strategy for mitigating overfitting in deep neural networks, and it plays a crucial role in 3D vision tasks, where the point cloud data is relatively limited. While mixingbased augmentation has shown promise for point clouds, previous methods mix point clouds either on block level or point level, which has constrained their ability to strike a balance between generating diverse training samples and preserving the local characteristics of point clouds. The significance of each part component of the point clouds has not been fully considered, as not all parts contribute equally to the classification task, and some parts may contain unimportant or redundant information. To overcome these challenges, we propose PointPatchMix, a novel approach that mixes point clouds at the patch level and integrates a patch scoring module to generate content-based targets for mixed point clouds. Our approach preserves local features at the patch level, while the patch scoring module assigns targets based on the content-based significance score from a pre-trained teacher model. We evaluate PointPatchMix on two benchmark datasets including ModelNet40 and ScanObjectNN, and demonstrate significant improvements over various baselines in both synthetic and real-world datasets, as well as few-shot settings. With Point-MAE as our baseline, our model surpasses previous methods by a significant margin. Furthermore, our approach shows strong generalization across various point cloud methods and enhances the robustness of the baseline model. Code is available at https://jiazewang.com/projects/pointpatchmix.html.



Paperid:630
Authors:Yijie Wang, Mingjian Hong, Luwen Huangfu, Sheng Huang
Chongqing University, Chongqing University, Fowler College of Business, San Diego State University, Chongqing University
Abstract:
In the realm of ZeroShot Learning (ZSL), we address biases in Generalized Zero-Shot Learning (GZSL) models, which favor seen data. To counter this, we introduce an end-to-end generative GZSL framework called D3GZSL. This framework respects seen and synthesized unseen data as in-distribution and out-of-distribution data, respectively, for a more balanced model. D3GZSL comprises two core modules: in-distribution dual space distillation (ID2SD) and out-of-distribution batch distillation (O2DBD). ID2SD aligns teacher-student outcomes in embedding and label spaces, enhancing learning coherence. O2DBD introduces low-dimensional out-of-distribution representations per batch sample, capturing shared structures between seen and un seen categories. Our approach demonstrates its effectiveness across established GZSL benchmarks, seamlessly integrating into mainstream generative frameworks. Extensive experiments consistently showcase that D3GZSL elevates the performance of existing generative GZSL methods, under scoring its potential to refine zero-shot learning practices. The code is available at: https://github.com/PJBQ/D3GZSL.git



Paperid:631
Authors:Yinqiao Wang, Hao Xu, Pheng Ann Heng, Chi-Wing Fu
The Chinese University of Hong Kong (CUHK), The Chinese University of Hong Kong (CUHK), The Chinese Univsersity of Hong Kong (CUHK), The Chinese University of Hong Kong (CUHK)
Abstract:
Estimating 3D hand mesh from RGB images is a longstanding track, in which occlusion is one of the most challenging problems. Existing attempts towards this task often fail when the occlusion dominates the image space. In this paper, we propose SiMAHand, aiming to boost the mesh reconstruction performance by Single-to-Multi-view Adaptation. First, we design a multi-view hand reconstructor to fuse information across multiple views by holistically adopting feature fusion at image, joint, and vertex levels. Then, we introduce a single-view hand reconstructor equipped with SiMA. Though taking only one view as input at inference, the shape and orientation features in the single-view reconstructor can be enriched by learning non-occluded knowledge from the extra views at training, enhancing the reconstruction precision on the occluded regions. We conduct experiments on the Dex-YCB and HanCo benchmarks with challenging object- and self-caused occlusion cases, manifesting that SiMA-Hand consistently achieves superior performance over the state of the arts. Code will be released on https://github.com/JoyboyWang/SiMA-Hand Pytorch.



Paperid:632
Authors:Youhong Wang, Yunji Liang, Hao Xu, Shaohui Jiao, Hongkai Yu
Northwestern Polytechnical University Bytedance, Northwestern Polytechnical University, Bytedance, Bytedance, Cleveland State University
Abstract:
Recently, selfsupervised monocular depth estimation has gained popularity with numerous applications in autonomous driving and robotics. However, existing solutions primarily seek to estimate depth from immediate visual features, and struggle to recover fine-grained scene details. In this paper, we introduce SQLdepth, a novel approach that can effectively learn fine-grained scene structure priors from ego-motion. In SQLdepth, we propose a novel Self Query Layer (SQL) to build a self-cost volume and infer depth from it, rather than inferring depth from feature maps. We show that, the self-cost volume is an effective inductive bias for geometry learning, which implicitly models the single-frame scene geometry, with each slice of it indicating a relative distance map between points and objects in a latent space. Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance, and showcases computational efficiency, reduced training complexity, and the ability to recover fine-grained scene details. Moreover, the self-matching-oriented relative distance querying in SQL improves the robustness and zero-shot generalization capability of SQLdepth. Code is available at https://github.com/hisfog/SfMNeXt-Impl.



Paperid:633
Authors:Yu Wang, Chao Tong
Sino-French Engineer School, Beihang University, School of Computer Science and Engineering, Beihang University State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Abstract:
3D Semantic Scene Completion (SSC) has emerged as a novel task in visionbased holistic 3D scene understanding. Its objective is to densely predict the occupancy and category of each voxel in a 3D scene based on input from either LiDAR or images. Currently, many transformer-based semantic scene completion frameworks employ simple yet popular Cross-Attention and Self-Attention mechanisms to integrate and infer dense geometric and semantic information of voxels. However, they overlook the distinctions among voxels in the scene, especially in outdoor scenarios where the horizontal direction contains more variations. And voxels located at object boundaries and within the interior of objects exhibit varying levels of positional significance. To address this issue, we propose a transformer-based SSC framework called H2GFormer that incorporates a horizontal-to-global approach. This framework takes into full consideration the variations of voxels in the horizontal direction and the characteristics of voxels on object boundaries. We introduce a horizontal window-to-global attention (W2G) module that effectively fuses semantic information by first diffusing it horizontally from reliably visible voxels and then propagating the semantic understanding to global voxels, ensuring a more reliable fusion of semantic-aware features. Moreover, an Internal-External Position Awareness Loss (IoE-PALoss) is utilized during network training to emphasize the critical positions within the transition regions between objects. The experiments conducted on the SemanticKITTI dataset demonstrate that H2GFormer exhibits superior performance in both geometric and semantic completion tasks. Our code is available on https://github.com/Ryanwy1/H2GFormer.



Paperid:634
Authors:Yu Wang, Junxian Mu, Pengfei Zhu, Qinghua Hu
Tianjin University, Tianjin University, Tianjin University, Tianjin University
Abstract:
Open set recognition (OSR) requires the model to classify samples that belong to closed sets while rejecting unknown samples during test. Currently, generative models often perform better than discriminative models in OSR, but recent studies show that generative models may be computationally infeasible or unstable on complex tasks. In this paper, we provide insights into OSR and find that learning supplementary representations can theoretically reduce the open space risk. Based on the analysis, we propose a new model, namely MultiExpert Diverse Attention Fusion (MEDAF), that learns diverse representations in a discriminative way. MEDAF consists of multiple experts that are learned with an attention diversity regularization term to ensure the attention maps are mutually different. The logits learned by each expert are adaptively fused and used to identify the unknowns through the score function. We show that the differences in attention maps can lead to diverse representations so that the fused representations can well handle the open space. Extensive experiments are conducted on standard and OSR large-scale benchmarks. Results show that the proposed discriminative method can outperform existing generative models by up to 9.5% on AUROC and achieve new state-of-the-art performance with little computational cost. Our method can also seamlessly integrate existing classification models. Code is available at https://github.com/Vanixxz/MEDAF.



Paperid:635
Authors:Yu-Hsiang Wang, Jun-Wei Hsieh, Ping-Yang Chen, Ming-Ching Chang, Hung-Hin So, Xin Li
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, University at Albany - SUNY, The Chinese University of Hong Kong, University at Albany - SUNY
Abstract:
Despite recent progress in Multiple Object Tracking (MOT), several obstacles such as occlusions, similar objects, and complex scenes remain an open challenge. Meanwhile, a systematic study of the costperformance tradeoff for the popular tracking-by-detection paradigm is still lacking. This paper introduces SMILEtrack, an innovative object tracker that effectively addresses these challenges by integrating an efficient object detector with a Siamese network-based Similarity Learning Module (SLM). The technical contributions of SMILETrack are twofold. First, we propose an SLM that calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate Detection and Embedding (SDE) models. The SLM incorporates a Patch Self-Attention (PSA) block inspired by the vision Transformer, which generates reliable features for accurate similarity matching. Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE function for robust object matching across consecutive video frames, further enhancing MOT performance. Together, these innovations help SMILETrack achieve an improved trade-off between the cost (e.g., running speed) and performance (e.g., tracking accuracy) over several existing state-of-the-art benchmarks, including the popular BYTETrack method. SMILETrack outperforms BYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points on MOT17 and MOT20 datasets. Code is available at http://github.com/pingyang1117/SMILEtrack_official.



Paperid:636
Authors:Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, Cairong Zhao
Tongji University, Microsoft Research Asia, Xidian University, Microsoft Research Asia, Tongji University
Abstract:
Prompt learning has become a prevalent strategy for adapting visionlanguage foundation models to downstream tasks. As large language models (LLMs) have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https://github.com/Vill-Lab/2024-AAAI-HPT.



Paperid:637
Authors:Yuhao Wang, Xuehu Liu, Pingping Zhang, Hu Lu, Zhengzheng Tu, Huchuan Lu
School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, School of Computer Science and Communication Engineering, Jiangsu University, School of Computer Science and Technology, Anhui University, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology
Abstract:
Multispectral object Re-identification (ReID) aims to retrieve specific objects by leveraging complementary information from different image spectra. It delivers great advantages over traditional single-spectral ReID in complex visual environment. However, the significant distribution gap among different image spectra poses great challenges for effective multi-spectral feature representations. In addition, most of current Transformer-based ReID methods only utilize the global feature of class tokens to achieve the holistic retrieval, ignoring the local discriminative ones. To address the above issues, we step further to utilize all the tokens of Transformers and propose a cyclic token permutation framework for multi-spectral object ReID, dubbled TOP-ReID. More specifically, we first deploy a multi-stream deep network based on vision Transformers to preserve distinct information from different image spectra. Then, we propose a Token Permutation Module (TPM) for cyclic multi-spectral feature aggregation. It not only facilitates the spatial feature alignment across different image spectra, but also allows the class token of each spectrum to perceive the local details of other spectra. Meanwhile, we propose a Complementary Reconstruction Module (CRM), which introduces dense token-level reconstruction constraints to reduce the distribution gap across different image spectra. With the above modules, our proposed framework can generate more discriminative multi-spectral features for robust object ReID. Extensive experiments on three ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) verify the effectiveness of our methods. The code is available at https://github.com/924973292/TOP-ReID.



Paperid:638
Authors:Yuting Wang, Jinpeng Wang, Bin Chen, Ziyun Zeng, Shu-Tao Xia
Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory, Harbin Institute of Technology, Shenzhen Research Center of Artificial Intelligence, Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory
Abstract:
Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanningbased clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer.



Paperid:639
Authors:Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Pinxue Guo, Kaixun Jiang, Wenqiang Zhang, Lizhe Qi
Fudan University, Fudan University, Fudan University, Fudan University, Fudan University, Fudan University, Fudan University
Abstract:
Adversarial Robustness Distillation (ARD) is a promising task to solve the issue of limited adversarial robustness of small capacity models while optimizing the expensive computational costs of Adversarial Training (AT). Despite the good robust performance, the existing ARD methods are still impractical to deploy in natural highsecurity scenes due to these methods rely entirely on original or publicly available data with a similar distribution. In fact, these data are almost always private, specific, and distinctive for scenes that require high robustness. To tackle these issues, we propose a challenging but significant task called Data-Free Adversarial Robustness Distillation (DFARD), which aims to train small, easily deployable, robust models without relying on data. We demonstrate that the challenge lies in the lower upper bound of knowledge transfer information, making it crucial to mining and transferring knowledge more efficiently. Inspired by human education, we design a plug-and-play Interactive Temperature Adjustment (ITA) strategy to improve the efficiency of knowledge transfer and propose an Adaptive Generator Balance (AGB) module to retain more data information. Our method uses adaptive hyperparameters to avoid a large number of parameter tuning, which significantly outperforms the combination of existing techniques. Meanwhile, our method achieves stable and reliable performance on multiple benchmarks.



Paperid:640
Authors:Zengbin Wang, Saihui Hou, Man Zhang, Xu Liu, Chunshui Cao, Yongzhen Huang, Peipei Li, Shibiao Xu
Beijing University of Posts and Telecommunications, Beijing Normal University Watrix AI, Beijing University of Posts and Telecommunications, Watrix AI, Watrix AI, Beijing Normal University Watrix AI, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Gait recognition is a promising biometric method that aims to identify pedestrians from their unique walking patterns. Silhouette modality, renowned for its easy acquisition, simple structure, sparse representation, and convenient modeling, has been widely employed in controlled inthe-lab research. However, as gait recognition rapidly advances from in-the-lab to in-the-wild scenarios, various conditions raise significant challenges for silhouette modality, including 1) unidentifiable low-quality silhouettes (abnormal segmentation, severe occlusion, or even non-human shape), and 2) identifiable but challenging silhouettes (background noise, non-standard posture, slight occlusion). To address these challenges, we revisit gait recognition pipeline and approach gait recognition from a quality perspective, namely QAGait. Specifically, we propose a series of cost-effective quality assessment strategies, including Maxmial Connect Area and Template Match to eliminate background noises and unidentifiable silhouettes, Alignment strategy to handle non-standard postures. We also propose two quality-aware loss functions to integrate silhouette quality into optimization within the embedding space. Extensive experiments demonstrate our QAGait can guarantee both gait reliability and performance enhancement. Furthermore, our quality assessment strategies can seamlessly integrate with existing gait datasets, showcasing our superiority. Code is available at https://github.com/wzb-bupt/QAGait.



Paperid:641
Authors:Zhaoyang Wang, Dongyang Li, Mingyang Zhang, Hao Luo, Maoguo Gong
Ministry of Education, Key Laboratory of Collaborative Intelligence Systems, Xidian University DAMO Academy, Alibaba Group, 310023, Hangzhou, China, DAMO Academy, Alibaba Group, 310023, Hangzhou, China Hupan Lab, 310023, Hangzhou, China, Ministry of Education, Key Laboratory of Collaborative Intelligence Systems, Xidian University, DAMO Academy, Alibaba Group, 310023, Hangzhou, China Hupan Lab, 310023, Hangzhou, China, Ministry of Education, Key Laboratory of Collaborative Intelligence Systems, Xidian University
Abstract:
Existing hyperspectral image (HSI) superresolution (SR) methods struggle to effectively capture the complex spectral-spatial relationships and low-level details, while diffusion models represent a promising generative model known for their exceptional performance in modeling complex relations and learning high and low-level visual features. The direct application of diffusion models to HSI SR is hampered by challenges such as difficulties in model convergence and protracted inference time. In this work, we introduce a novel Group-Autoencoder (GAE) framework that synergistically combines with the diffusion model to construct a highly effective HSI SR model (DMGASR). Our proposed GAE framework encodes high-dimensional HSI data into low-dimensional latent space where the diffusion model works, thereby alleviating the difficulty of training the diffusion model while maintaining band correlation and considerably reducing inference time. Experimental results on both natural and remote sensing hyperspectral datasets demonstrate that the proposed method is superior to other state-of-the-art methods both visually and metrically.



Paperid:642
Authors:Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, Ram Rajagopal
Stanford University, Stanford University, Stanford University, Stanford University, Stanford University
Abstract:
Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, taskagnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.



Paperid:643
Authors:Zhehao Wang, Xian Lin, Nannan Wu, Li Yu, Kwang-Ting Cheng, Zengqiang Yan
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Hong Kong University of Science and Technology, Huazhong University of Science and Technology
Abstract:
Despite the great potential in capturing longrange dependency, one rarely-explored underlying issue of transformer in medical image segmentation is attention collapse, making it often degenerate into a bypass module in CNN-Transformer hybrid architectures. This is due to the high computational complexity of vision transformers requiring extensive training data while well-annotated medical image data is relatively limited, resulting in poor convergence. In this paper, we propose a plug-n-play transformer block with dynamic token merging, named DTMFormer, to avoid building long-range dependency on redundant and duplicated tokens and thus pursue better convergence. Specifically, DTMFormer consists of an attention-guided token merging (ATM) module to adaptively cluster tokens into fewer semantic tokens based on feature and dependency similarity and a light token reconstruction module to fuse ordinary and semantic tokens. In this way, as self-attention in ATM is calculated based on fewer tokens, DTMFormer is of lower complexity and more friendly to converge. Extensive experiments on publicly-available datasets demonstrate the effectiveness of DTMFormer working as a plug-n-play module for simultaneous complexity reduction and performance improvement. We believe it will inspire future work on rethinking transformers in medical image segmentation. Code: https://github.com/iam-nacl/DTMFormer.



Paperid:644
Authors:Zhengxue Wang, Zhiqiang Yan, Jian Yang
PCA Lab, Nanjing University of Science and Technology, China, PCA Lab, Nanjing University of Science and Technology, China, PCA Lab, Nanjing University of Science and Technology, China
Abstract:
Depth superresolution (DSR) aims to restore high-resolution (HR) depth from low-resolution (LR) one, where RGB image is often used to promote this task. Recent image guided DSR approaches mainly focus on spatial domain to rebuild depth structure. However, since the structure of LR depth is usually blurry, only considering spatial domain is not very sufficient to acquire satisfactory results. In this paper, we propose structure guided network (SGNet), a method that pays more attention to gradient and frequency domains, both of which have the inherent ability to capture high-frequency structure. Specifically, we first introduce the gradient calibration module (GCM), which employs the accurate gradient prior of RGB to sharpen the LR depth structure. Then we present the Frequency Awareness Module (FAM) that recursively conducts multiple spectrum differencing blocks (SDB), each of which propagates the precise high-frequency components of RGB into the LR depth. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our SGNet, reaching the state-of-the-art (see Fig. 1). Codes and pre-trained models are available at https://github.com/yanzq95/SGNet.



Paperid:645
Authors:Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu
Huazhong Univ. of Sci.&Tech., Huazhong Univ. of Sci.&Tech., Huazhong Univ. of Sci.&Tech., Huazhong Univ. of Sci.&Tech.
Abstract:
Classagnostic counting (CAC) aims to count objects of interest from a query image given few exemplars. This task is typically addressed by extracting the features of query image and exemplars respectively and then matching their feature similarity, leading to an extract-then-match paradigm. In this work, we show that CAC can be simplified in an extract-and-match manner, particularly using a vision transformer (ViT) where feature extraction and similarity matching are executed simultaneously within the self-attention. We reveal the rationale of such simplification from a decoupled view of the self-attention.The resulting model, termed CACViT, simplifies the CAC pipeline into a single pretrained plain ViT. Further, to compensate the loss of the scale and the order-of-magnitude information due to resizing and normalization in plain ViT, we present two effective strategies for scale and magnitude embedding. Extensive experiments on the FSC147 and the CARPK datasets show that CACViT significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACViT provides a concise and strong baseline for CAC. Code will be available.



Paperid:646
Authors:Zhihao Wang, Yulin Zhou, Ningyu Zhang, Xiaosong Yang, Jun Xiao, Zhao Wang
Zhejiang University Ninbo Innovation Center, Zhejiang University, Zhejiang University Ninbo Innovation Center, Zhejiang University, Zhejiang University, Bournemouth University, Zhejiang University, Ninbo Innovation Center, Zhejiang University
Abstract:
Human motion prediction is consisting in forecasting future body poses from historically observed sequences. It is a longstanding challenge due to motion's complex dynamics and uncertainty. Existing methods focus on building up complicated neural networks to model the motion dynamics. The predicted results are required to be strictly similar to the training samples with L2 loss in current training pipeline. However, little attention has been paid to the uncertainty property which is crucial to the prediction task. We argue that the recorded motion in training data could be an observation of possible future, rather than a predetermined result. In addition, existing works calculate the predicted error on each future frame equally during training, while recent work indicated that different frames could play different roles. In this work, a novel computationally efficient encoderdecoder model with uncertainty consideration is proposed, which could learn proper characteristics for future frames by a dynamic function. Experimental results on benchmark datasets demonstrate that our uncertainty consideration approach has obvious advantages both in quantity and quality. Moreover, the proposed method could produce motion sequences with much better quality that avoids the intractable shaking artefacts. We believe our work could provide a novel perspective to consider the uncertainty quality for the general motion prediction task and encourage the studies in this field. The code will be available in https://github.com/Motionpre/Adaptive-Salient-Loss-SAGGB.



Paperid:647
Authors:Zi Wang, Huaibo Huang, Aihua Zheng, Ran He
School of Computer Science and Technology, Anhui University, Hefei, China, MAIS & CRIPAC, CASIA, Beijing, China, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Artificial Intelligence, Anhui University, Hefei, China, MAIS & CRIPAC, CASIA, Beijing, China
Abstract:
Multimodal person re-identification (ReID) seeks to mitigate challenging lighting conditions by incorporating diverse modalities. Most existing multi-modal ReID methods concentrate on leveraging complementary multi-modal information via fusion or interaction. However, the relationships among heterogeneous modalities and the domain traits of unlabeled test data are rarely explored. In this paper, we propose a Heterogeneous Test-time Training (HTT) framework for multi-modal person ReID. We first propose a Cross-identity Inter-modal Margin (CIM) loss to amplify the differentiation among distinct identity samples. Moreover, we design a Multi-modal Test-time Training (MTT) strategy to enhance the generalization of the model by leveraging the relationships in the heterogeneous modalities and the information existing in the test data. Specifically, in the training stage, we utilize the CIM loss to further enlarge the distance between anchor and negative by forcing the inter-modal distance to maintain the margin, resulting in an enhancement of the discriminative capacity of the ultimate descriptor. Subsequently, since the test data contains characteristics of the target domain, we adapt the MTT strategy to optimize the network before the inference by using self-supervised tasks designed based on relationships among modalities. Experimental results on benchmark multi-modal ReID datasets RGBNT201, Market1501-MM, RGBN300, and RGBNT100 validate the effectiveness of the proposed method. The codes can be found at https://github.com/ziwang1121/HTT.



Paperid:648
Authors:Zichen Wang, Bo Yang, Haonan Yue, Zhenghao Ma
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Fewshot object detection (FSOD) aims at extending a generic detector for novel object detection with only a few training examples. It attracts great concerns recently due to the practical meanings. Meta-learning has been demonstrated to be an effective paradigm for this task. In general, methods based on meta-learning employ an additional support branch to encode novel examples (a.k.a. support images) into class prototypes, which are then fused with query branch to facilitate the model prediction. However, the class-level prototypes are difficult to precisely generate, and they also lack detailed information, leading to instability in performance. New methods are required to capture the distinctive local context for more robust novel object detection. To this end, we propose to distill the most representative support features into fine-grained prototypes. These prototypes are then assigned into query feature maps based on the matching results, modeling the detailed feature relations between two branches. This process is realized by our Fine-Grained Feature Aggregation (FFA) module. Moreover, in terms of high-level feature fusion, we propose Balanced Class-Agnostic Sampling (B-CAS) strategy and Non-Linear Fusion (NLF) module from differenct perspectives. They are complementary to each other and depict the high-level feature relations more effectively. Extensive experiments on PASCAL VOC and MS COCO benchmarks show that our method sets a new state-of-the-art performance in most settings. Our code is available at https://github.com/wangchen1801/FPD.



Paperid:649
Authors:Zifan Wang, Zhuorui Ye, Haoran Wu, Junyu Chen, Li Yi
Tsinghua University Shanghai Qi Zhi Institute, Tsinghua University Shanghai Qi Zhi Institute, Tsinghua University Shanghai Qi Zhi Institute, Tsinghua University Shanghai Qi Zhi Institute, Tsinghua University Shanghai Artificial Intelligence Laboratory Shanghai Qi Zhi Institute
Abstract:
We study a new problem of semantic complete scene forecasting (SCSF) in this work. Given a 4D dynamic point cloud sequence, our goal is to forecast the complete scene corresponding to the future next frame along with its semantic labels. To tackle this challenging problem, we properly model the synergetic relationship between future forecasting and semantic scene completion through a novel network named SCSFNet. SCSFNet leverages a hybrid geometric representation for highresolution complete scene forecasting. To leverage multi-frame observation as well as the understanding of scene dynamics to ease the completion task, SCSFNet introduces an attention-based skip connection scheme. To ease the need to model occlusion variations and to better focus on the occluded part, SCSFNet utilizes auxiliary visibility grids to guide the forecasting task. To evaluate the effectiveness of SCSFNet, we conduct experiments on various benchmarks including two large-scale indoor benchmarks we contributed and the outdoor SemanticKITTI benchmark. Extensive experiments show SCSFNet outperforms baseline methods on multiple metrics by a large margin, and also prove the synergy between future forecasting and semantic scene completion.The project page with code is available at scsfnet.github.io.



Paperid:650
Authors:Dong Wei, Xiaoning Sun, Huaijiang Sun, Shengxiang Hu, Bin Li, Weiqing Li, Jianfeng Lu
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, Tianjin AiForward Science and Technology Co., Ltd., Tianjin, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Abstract:
The emergence of textdriven motion synthesis technique provides animators with great potential to create efficiently. However, in most cases, textual expressions only contain general and qualitative motion descriptions, while lack fine depiction and sufficient intensity, leading to the synthesized motions that either (a) semantically compliant but uncontrollable over specific pose details, or (b) even deviates from the provided descriptions, bringing animators with undesired cases. In this paper, we propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with KeyFrames Collaborated, enabling realistic generation with collaborative and efficient dual-level control: coarse guidance at semantic level, with only few keyframes for direct and fine-grained depiction down to body posture level. Unlike existing inference-editing diffusion models that incorporate conditions without training, our conditional diffusion model is explicitly trained and can fully exploit correlations among texts, keyframes and the diffused target frames. To preserve the control capability of discrete and sparse keyframes, we customize dilated mask attention modules where only partial valid tokens participate in local-to-global attention, indicated by the dilated keyframe mask. Additionally, we develop a simple yet effective smoothness prior, which steers the generated frames towards seamless keyframe transitions at inference. Extensive experiments show that our model not only achieves state-of-the-art performance in terms of semantic fidelity, but more importantly, is able to satisfy animator requirements through fine-grained guidance without tedious labor.



Paperid:651
Authors:Jiajun Wei, Hongjian Zhan, Yue Lu, Xiao Tu, Bing Yin, Cong Liu, Umapada Pal
Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, iFLYTEK Research, iFLYTEK Research, Indian Statistical Institute, Kolkata
Abstract:
Scene text recognition is inherently a visionlanguage task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet.



Paperid:652
Authors:Jun Wei, S. Kevin Zhou, Shuguang Cui, Zhen Li
FNii, CUHK-Shenzhen, Shenzhen, China SSE, CUHK-Shenzhen, Shenzhen, China, School of Biomedical Engineering & Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, SSE, CUHK-Shenzhen, Shenzhen, China FNii, CUHK-Shenzhen, Shenzhen, China, SSE, CUHK-Shenzhen, Shenzhen, China FNii, CUHK-Shenzhen, Shenzhen, China
Abstract:
Point cloud salient object detection (PCSOD) is a newly proposed task in 3D dense segmentation. However, the acquisition of accurate 3D dense annotations comes at a high cost, severely limiting the progress of PCSOD. To address this issue, we propose the first weakly supervised PCSOD (named WeakPCSOD) model, which relies solely on cheap 3D bounding box annotations. In WeakPCSOD, we extract noisefree supervision from coarse 3D bounding boxes while mitigating shape biases inherent in box annotations. To achieve this, we introduce a novel mask-to-box (M2B) transformation and a color consistency (CC) loss. The M2B transformation, from a shape perspective, disentangles predictions from labels, enabling the extraction of noiseless supervision from labels while preserving object shapes independently of the box bias. From an appearance perspective, we further introduce the CC loss to provide dense supervision, which mitigates the non-unique predictions stemming from weak supervision and substantially reduces prediction variability. Furthermore, we employ a self-training (ST) strategy to enhance performance by utilizing high-confidence pseudo labels. Notably, the M2B transformation, CC loss, and ST strategy are seamlessly integrated into any model and incur no computational costs for inference. Extensive experiments demonstrate the effectiveness of our WeakPCSOD model, even comparable to fully supervised models utilizing dense annotations.



Paperid:653
Authors:Xue Wen, Lianxin Xie, Le Jiang, Tianyi Chen, Si Wu, Cheng Liu, Hau-San Wong
South China University of Technology, South China University of Technology, South China University of Technology, South China University of Technology, South China University of Technology, Shantou University, City University of Hong Kong
Abstract:
Face retouching is to beautify a face image, while preserving the image content as much as possible. It is a promising yet challenging task to remove face imperfections and fill with normal skin. Generic image enhancement methods are hampered by the lack of imperfection localization, which often results in incomplete removal of blemishes at large scales. To address this issue, we propose a transformerbased approach, RetouchFormer, which simultaneously identify imperfections and synthesize realistic content in the corresponding regions. Specifically, we learn a latent dictionary to capture the clean face priors, and predict the imperfection regions via a reconstruction-oriented localization module. Also based on this, we can realize face retouching by explicitly suppressing imperfections in our selective self-attention computation, such that local content will be synthesized from normal skin. On the other hand, multi-scale feature tokens lead to increased flexibility in dealing with the imperfections at various scales. The design elements bring greater effectiveness and efficiency. RetouchFormer outperforms the advanced face retouching methods and synthesizes clean face images with high fidelity in our list of extensive experiments performed.



Paperid:654
Authors:Weixi Weng, Chun Yuan
Tsinghua Shenzhen International Graduate School, Tsinghua University, Tsinghua Shenzhen International Graduate School, Tsinghua University
Abstract:
Unsupervised domain adaptation object detection(UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. Onestage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer(OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment(MDQFA) and Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model's target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM.



Paperid:655
Authors:Tom Nuno Wolf, Fabian Bongratz, Anne-Marie Rickmann, Sebastian Pölsterl, Christian Wachinger
Technical University of Munich Ludwig Maximilians University Munich Munich Center for Machine Learning (MCML), Technical University of Munich Ludwig Maximilians University Munich Munich Center for Machine Learning (MCML), Technical University of Munich Ludwig Maximilians University Munich, Ludwig Maximilians University Munich, Technical University of Munich Ludwig Maximilians University Munich Munich Center for Machine Learning (MCML)
Abstract:
Explaining predictions of blackbox neural networks is crucial when applied to decision-critical tasks. Thus, attribution maps are commonly used to identify important image regions, despite prior work showing that humans prefer explanations based on similar examples. To this end, ProtoPNet learns a set of class-representative feature vectors (prototypes) for case-based reasoning. During inference, similarities of latent features to prototypes are linearly classified to form predictions and attribution maps are provided to explain the similarity. In this work, we evaluate whether architectures for case-based reasoning fulfill established axioms required for faithful explanations using the example of ProtoPNet. We show that such architectures allow the extraction of faithful explanations. However, we prove that the attribution maps used to explain the similarities violate the axioms. We propose a new procedure to extract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually, these explanations are Shapley values, calculated on the similarity scores of each prototype. They allow to faithfully answer which prototypes are present in an unseen image and quantify each pixel’s contribution to that presence, thereby complying with all axioms. The theoretical violations of ProtoPNet manifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs, RSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50, ResNeXt50). Our experiments show a qualitative difference between the explanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the explanations with the Area Over the Perturbation Curve, on which ProtoPFaith outperforms ProtoPNet on all experiments by a factor >10^3.



Paperid:656
Authors:Ancong Wu, Wei-Shi Zheng
Sun Yat-sen University, China, Sun Yat-sen University, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China Guangdong Key Laboratory of Information Security Technology, China
Abstract:
Unsupervised disentangled representation learning aims to recover semantically meaningful factors from realworld data without supervision, which is significant for model generalization and interpretability. Current methods mainly rely on assumptions of independence or informativeness of factors, regardless of interpretability. Intuitively, visually interpretable concepts better align with human-defined factors. However, exploiting visual interpretability as inductive bias is still under-explored. Inspired by the observation that most explanatory image factors can be represented by ``content + mask'', we propose a content-mask factorization network (CMFNet) to decompose an image into different groups of content codes and masks, which are further combined as content masks to represent different visual concepts. To ensure informativeness of the representations, the CMFNet is jointly learned with a generator conditioned on the content masks for reconstructing the input image. The conditional generator employs a diffusion model to leverage its robust distribution modeling capability. Our model is called the Factorized Diffusion Autoencoder (FDAE). To enhance disentanglement of visual concepts, we propose a content decorrelation loss and a mask entropy loss to decorrelate content masks in latent space and spatial space, respectively. Experiments on Shapes3d, MPI3D and Cars3d show that our method achieves advanced performance and can generate visually interpretable concept-specific masks. Source code and supplementary materials are available at https://github.com/wuancong/FDAE.



Paperid:657
Authors:Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun
Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University
Abstract:
In 3D Referring Expression Segmentation (3DRES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only sets new performance standards, registering an mIoU gain of 11.7 points but also achieves a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN.



Paperid:658
Authors:Cong Wu, Xiao-Jun Wu, Josef Kittler, Tianyang Xu, Sara Ahmed, Muhammad Awais, Zhenhua Feng
Jiangnan University, University of Surrey, Jiangnan University, University of Surrey, Jiangnan University, University of Surrey, University of Surrey, University of Surrey
Abstract:
Contrastive learning has achieved great success in skeletonbased action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly. Our code and supplementary material can be found at https://github.com/cong-wu/SCD-Net.



Paperid:659
Authors:Fan Wu, Jinling Gao, Lanqing Hong, Xinbing Wang, Chenghu Zhou, Nanyang Ye
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Huawei Noah's Ark Lab, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
In this paper, we focus on a realistic yet challenging task, Single Domain Generalization Object Detection (SDGOD), where only one source domain's data can be used for training object detectors, but have to generalize multiple distinct target domains. In S-DGOD, both high-capacity fitting and generalization abilities are needed due to the task's complexity. Differentiable Neural Architecture Search (NAS) is known for its high capacity for complex data fitting and we propose to leverage Differentiable NAS to solve S-DGOD. However, it may confront severe over-fitting issues due to the feature imbalance phenomenon, where parameters optimized by gradient descent are biased to learn from the easy-to-learn features, which are usually non-causal and spuriously correlated to ground truth labels, such as the features of background in object detection data. Consequently, this leads to serious performance degradation, especially in generalizing to unseen target domains with huge domain gaps between the source domain and target domains. To address this issue, we propose the Generalizable loss (G-loss), which is an OoD-aware objective, preventing NAS from over-fitting by using gradient descent to optimize parameters not only on a subset of easy-to-learn features but also the remaining predictive features for generalization, and the overall framework is named G-NAS. Experimental results on the S-DGOD urban-scene datasets demonstrate that the proposed G-NAS achieves SOTA performance compared to baseline methods. Codes are available at https://github.com/wufan-cse/G-NAS.



Paperid:660
Authors:Fuzhi Wu, Jiasong Wu, Youyong Kong, Chunfeng Yang, Guanyu Yang, Huazhong Shu, Guy Carrault, Lotfi Senhadji
Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University) Laboratoire Traitement du Signal et de l'Image (Univ Rennes) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University), Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University), Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University), Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University), Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University), Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University), Laboratoire Traitement du Signal et de l'Image (Univ Rennes) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University), Laboratoire Traitement du Signal et de l'Image (Univ Rennes) Centre de Recherche en Information Biomédicale Sino-français (CRIBs) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing (Southeast University)
Abstract:
Deep learning and Convolutional Neural Networks (CNNs) have driven major transformations in diverse research areas. However, their limitations in handling lowfrequency in-formation present obstacles in certain tasks like interpreting global structures or managing smooth transition images. Despite the promising performance of transformer struc-tures in numerous tasks, their intricate optimization com-plexities highlight the persistent need for refined CNN en-hancements using limited resources. Responding to these complexities, we introduce a novel framework, the Mul-tiscale Low-Frequency Memory (MLFM) Network, with the goal to harness the full potential of CNNs while keep-ing their complexity unchanged. The MLFM efficiently preserves low-frequency information, enhancing perfor-mance in targeted computer vision tasks. Central to our MLFM is the Low-Frequency Memory Unit (LFMU), which stores various low-frequency data and forms a parallel channel to the core network. A key advantage of MLFM is its seamless compatibility with various prevalent networks, requiring no alterations to their original core structure. Testing on ImageNet demonstrated substantial accuracy improvements in multiple 2D CNNs, including ResNet, MobileNet, EfficientNet, and ConvNeXt. Furthermore, we showcase MLFM's versatility beyond traditional image classification by successfully integrating it into image-to-image translation tasks, specifically in semantic segmenta-tion networks like FCN and U-Net. In conclusion, our work signifies a pivotal stride in the journey of optimizing the ef-ficacy and efficiency of CNNs with limited resources. This research builds upon the existing CNN foundations and paves the way for future advancements in computer vision. Our codes are available at https://github.com/AlphaWuSeu/MLFM.



Paperid:661
Authors:Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
Contrastive learning has emerged as a prevailing paradigm for highlevel vision tasks, which, by introducing properly negative samples, has also been exploited for low-level vision tasks to achieve a compact optimization space to account for their ill-posed nature. However, existing methods rely on manually predefined and task-oriented negatives, which often exhibit pronounced task-specific biases. To address this challenge, our paper introduces an innovative method termed 'learning from history', which dynamically generates negative samples from the target model itself. Our approach, named Model Contrastive Learning for Image Restoration (MCLIR), rejuvenates latency models as negative models, making it compatible with diverse image restoration tasks. We propose the Self-Prior guided Negative loss (SPN) to enable it. This approach significantly enhances existing models when retrained with the proposed model contrastive paradigm. The results show significant improvements in image restoration across various tasks and architectures. For example, models retrained with SPN outperform the original FFANet and DehazeFormer by 3.41 and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly, they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over lightweight SwinIR, respectively. Code and retrained models are available at https://github.com/Aitical/MCLIR.



Paperid:662
Authors:Guanyao Wu, Hongming Fu, Jinyuan Liu, Long Ma, Xin Fan, Risheng Liu
Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology
Abstract:
Multiexposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field grapples with challenges, notably the reliance on manual designs for network structures and loss functions, and the constraints of utilizing simulated reference images as ground truths. Consequently, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation. In addressing these challenges, this paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. More specifically, we harness a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Besides, a hybrid supervised contrast constraint seamlessly guides and integrates with searching process, facilitating a more adaptive and comprehensive search for optimal loss functions. We realize the state-of-the-art performance in comparison to various competitive schemes, yielding a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF) for general and no-reference scenarios, respectively, while providing results with high contrast, rich details and colors. The code is available at https://github.com/RollingPlain/HSDS_MEF.



Paperid:663
Authors:Haihang Wu, Wei Wang, Tamasha Malepathirana, Damith Senanayake, Denny Oetomo, Saman Halgamuge
Department of Mechanical Engineering, The University of Melbourne, Department of Mechanical Engineering, The University of Melbourne, Department of Mechanical Engineering, The University of Melbourne, Department of Mechanical Engineering, The University of Melbourne, Department of Mechanical Engineering, The University of Melbourne, Department of Mechanical Engineering, The University of Melbourne
Abstract:
Neural growth is the process of growing a small neural network to a large network and has been utilized to accelerate the training of deep neural networks. One crucial aspect of neural growth is determining the optimal growth timing. However, few studies investigate this systematically. Our study reveals that neural growth inherently exhibits a regularization effect, whose intensity is influenced by the chosen policy for growth timing. While this regularization effect may mitigate the overfitting risk of the model, it may lead to a notable accuracy drop when the model underfits. Yet, current approaches have not addressed this issue due to their lack of consideration of the regularization effect from neural growth. Motivated by these findings, we propose an under/over fitting riskaware growth timing policy, which automatically adjusts the growth timing informed by the level of potential under/overfitting risks to address both risks. Comprehensive experiments conducted using CIFAR-10/100 and ImageNet datasets show that the proposed policy achieves accuracy improvements of up to 1.3% in models prone to underfitting while achieving similar accuracies in models suffering from overfitting compared to the existing methods.



Paperid:664
Authors:Haoyuan Wu, Xinyun Zhang, Peng Xu, Peiyu Liao, Xufeng Yao, Bei Yu
The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong
Abstract:
VisionLanguage models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks. In light of the rapidly increasing size of pre-trained VLMs, parameter-efficient transfer learning (PETL) has garnered attention as a viable alternative to full fine-tuning. One such approach is the adapter, which introduces a few trainable parameters into the pre-trained models while preserving the original parameters during adaptation. In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively. Within this framework, tuning adapters in VLMs necessitates handling heterophilic graphs, owing to the disparity between the projected query and value space. To address this challenge, we propose a new adapter architecture, p-adapter, which employs p-Laplacian message passing in Graph Neural Networks (GNNs). Specifically, the attention weights are re-normalized based on the features, and the features are then aggregated using the calibrated attention matrix, enabling the dynamic exploitation of information with varying frequencies in the heterophilic attention graphs. We conduct extensive experiments on different pre-trained VLMs and multi-modal tasks, including visual question answering, visual entailment, and image captioning. The experimental results validate our method's significant superiority over other PETL methods. Our code is available at https://github.com/wuhy68/p-Adapter/.



Paperid:665
Authors:Jiamin Wu, Xin Liu, Xiaotian Yin, Tianzhu Zhang, Yongdong Zhang
Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China, Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China
Abstract:
CrossDomain Few-Shot Learning (CD-FSL) aims at recognizing samples in novel classes from unseen domains that are vastly different from training classes, with few labeled samples. However, the large domain gap between training and novel classes makes previous FSL methods perform poorly. To address this issue, we propose MetaPrompt, a Task-adaptive Prompted Transformer model for CD-FSL, by jointly exploiting prompt learning and the parameter generation framework. The proposed MetaPrompt enjoys several merits. First, a task-conditioned prompt generator is established upon attention mechanisms. It can flexibly produce a task-adaptive prompt with arbitrary length for unseen tasks, by selectively gathering task characteristics from the contextualized support embeddings. Second, the task-adaptive prompt is attached to Vision Transformer to facilitate fast task adaptation, steering the task-agnostic representation to incorporate task knowledge. To our best knowledge, this is the first work to exploit a prompt-based parameter generation mechanism for CD-FSL. Extensive experimental results on the Meta-Dataset benchmark demonstrate that our method achieves superior results against state-of-the-art methods.



Paperid:666
Authors:Jie Wu, Yuchao Feng, Honghui Xu, Chuanmeng Zhu, Jianwei Zheng
Zhejiang University of Technology, Zhejiang University of Technology, Zhejiang University of Technology, Zhejiang University, Zhejiang University of Technology
Abstract:
Image inpainting is in full bloom accompanied by the progress of convolutional neural networks (CNNs) and transformers, revolutionizing the practical management of abnormity disposal, image editing, etc. However, due to the evermounting image resolutions and missing areas, the challenges of distorted long-range dependencies from cluttered background distributions and reduced reference information in image domain inevitably rise, which further cause severe performance degradation. To address the challenges, we propose a novel large-portion image inpainting approach, namely the Structure-Guided Synergism Transformer (SyFormer), to rectify the discrepancies in feature representation and enrich the structural cues from limited reference. Specifically, we devise a dual-routing filtering module that employs a progressive filtering strategy to eliminate invalid noise interference and establish global-level texture correlations. Simultaneously, the structurally compact perception module maps an affinity matrix within the introduced structural priors from a structure-aware generator, assisting in matching and filling the corresponding patches of large-proportionally damaged images. Moreover, we carefully assemble the aforementioned modules to achieve feature complementarity. Finally, a feature decoding alignment scheme is introduced in the decoding process, which meticulously achieves texture amalgamation across hierarchical features. Extensive experiments are conducted on two publicly available datasets, i.e., CelebA-HQ and Places2, to qualitatively and quantitatively demonstrate the superiority of our model over state-of-the-arts.



Paperid:667
Authors:Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, Yanwu Xu
University of Oxford National University of Singapore Mohamed bin Zayed University of Artificial Intelligence Kids with Tokens, University of Alberta, Institute of High Performance Computing, A*STAR, Carnegie Mellon University Mohamed bin Zayed University of Artificial Intelligence, National University of Singapore, Singapore Eye Research Institute
Abstract:
The Diffusion Probabilistic Model (DPM) has recently gained popularity in the field of computer vision, thanks to its image generation applications, such as Imagen, Latent Diffusion Models, and Stable Diffusion, which have demonstrated impressive capabilities and sparked much discussion within the community. Recent investigations have further unveiled the utility of DPM in the domain of medical image analysis, as underscored by the commendable performance exhibited by the medical image segmentation model across various tasks. Although these models were originally underpinned by a UNet architecture, there exists a potential avenue for enhancing their performance through the integration of vision transformer mechanisms. However, we discovered that simply combining these two models resulted in subpar performance. To effectively integrate these two cuttingedge techniques for the Medical image segmentation, we propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2. We verify its effectiveness on 20 medical image segmentation tasks with different image modalities. Through comprehensive evaluation, our approach demonstrates superiority over prior state-of-the-art (SOTA) methodologies. Code is released at https://github.com/KidsWithTokens/MedSegDiff.



Paperid:668
Authors:Junyi Wu, Yan Huang, Min Gao, Yuzhen Niu, Mingjing Yang, Zhipeng Gao, Jianqiang Zhao
AI Research Center, Xiamen Meiya Pico Information Company Ltd., Xiamen, China Xiamen Meiya Pico Information Security Research Institute Company Ltd., Xiamen, China College of Computer and Data Science, Fuzhou University, Fuzhou, China, Institute of Automation, Chinese Academy of Sciences, Beijing China, College of Physics and Information Engineering, Fuzhou University, Fuzhou, China, College of Computer and Data Science, Fuzhou University, Fuzhou, China, College of Physics and Information Engineering, Fuzhou University, Fuzhou, China, AI Research Center, Xiamen Meiya Pico Information Company Ltd., Xiamen, China Xiamen Meiya Pico Information Security Research Institute Company Ltd., Xiamen, China, AI Research Center, Xiamen Meiya Pico Information Company Ltd., Xiamen, China Xiamen Meiya Pico Information Security Research Institute Company Ltd., Xiamen, China
Abstract:
Pedestrian Attribute Recognition (PAR) involves identifying the attributes of individuals in person images. Existing PAR methods typically rely on CNNs as the backbone network to extract pedestrian features. However, CNNs process only one adjacent region at a time, leading to the loss of longrange inter-relations between different attribute-specific regions. To address this limitation, we leverage the Vision Transformer (ViT) instead of CNNs as the backbone for PAR, aiming to model long-range relations and extract more robust features. However, PAR suffers from an inherent attribute imbalance issue, causing ViT to naturally focus more on attributes that appear frequently in the training set and ignore some pedestrian attributes that appear less. The native features extracted by ViT are not able to tolerate the imbalance attribute distribution issue. To tackle this issue, we propose two novel components: the Selective Feature Activation Method (SFAM) and the Orthogonal Feature Activation Loss. SFAM smartly suppresses the more informative attribute-specific features, compelling the PAR model to capture discriminative features from regions that are easily overlooked. The proposed loss enforces an orthogonal constraint on the original feature extracted by ViT and the suppressed features from SFAM, promoting the complementarity of features in space. We conduct experiments on several benchmark PAR datasets, including PETA, PA100K, RAPv1, and RAPv2, demonstrating the effectiveness of our method. Specifically, our method outperforms existing state-of-the-art approaches by GRL, IAA-Caps, ALM, and SSC in terms of mA on the four datasets, respectively.



Paperid:669
Authors:Ke Wu, Kaizhao Zhang, Mingzhe Gao, Jieru Zhao, Zhongxue Gan, Wenchao Ding
Fudan University, Harbin Institute of Technology, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Fudan University, Fudan University
Abstract:
Online dense mapping of urban scenes is of paramount importance for scene understanding of autonomous navigation. Traditional online dense mapping methods fuse sensor measurements (vision, lidar, etc.) across time and space via explicit geometric correspondence. Recently, NeRFbased methods have proved the superiority of neural implicit representations by high-fidelity reconstruction of large-scale city scenes. However, it remains an open problem how to integrate powerful neural implicit representations into online dense mapping. Existing methods are restricted to constrained indoor environments and are too computationally expensive to meet online requirements. To this end, we propose Swift-Mapping, an online neural implicit dense mapping framework in urban scenes. We introduce a novel neural implicit octomap (NIO) structure that provides efficient neural representation for large and dynamic urban scenes while retaining online update capability. Based on that, we propose an online neural dense mapping framework that effectively manages and updates neural octree voxel features. Our approach achieves SOTA reconstruction accuracy while being more than 10x faster in reconstruction speed, demonstrating the superior performance of our method in both accuracy and efficiency.



Paperid:670
Authors:Longhuang Wu, Shangxuan Tian, Youxin Wang, Pengfei Xiong
Shopee Pte. Ltd., Shopee Pte. Ltd., Shopee Pte. Ltd., Shopee Pte. Ltd.
Abstract:
Existing methods for scene text detection can be divided into two paradigms: segmentationbased and anchor-based. While Segmentation-based methods are well-suited for irregular shapes, they struggle with compact or overlapping layouts. Conversely, anchor-based approaches excel for complex layouts but suffer from irregular shapes. To strengthen their merits and overcome their respective demerits, we propose a Complementary Proposal Network (CPN) that seamlessly and parallelly integrates semantic and geometric information for superior performance. The CPN comprises two efficient networks for proposal generation: the Deformable Morphology Semantic Network, which generates semantic proposals employing an innovative deformable morphological operator, and the Balanced Region Proposal Network, which produces geometric proposals with pre-defined anchors. To further enhance the complementarity, we introduce an Interleaved Feature Attention module that enables semantic and geometric features to interact deeply before proposal generation. By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable computation cost. Specifically, our approach achieves improvements of 3.6%, 1.3% and 1.0% on challenging benchmarks ICDAR19-ArT, IC15, and MSRA-TD500, respectively. Code for our method will be released.



Paperid:671
Authors:Mingrui Wu, Yuqi Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, China
Abstract:
This work is oriented toward the task of openset Human Object Interaction (HOI) detection. The challenge lies in identifying completely new, out-of-domain relationships, as opposed to in-domain ones which have seen improvements in zero-shot HOI detection. To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM). We utilize a disentangled image-text contrastive learning metric for training and connect the bottom-up visual features to text embeddings through lightweight unary and pair-wise adapters. Our model can benefit from the open-set object detector and the VLM to detect novel action categories and combine actions with novel object categories. We further present the VG-HOI dataset, a comprehensive benchmark with over 17k HOI relationships for open-set scenarios. Experimental results show that our model can detect unknown action classes and combine unknown object classes. Furthermore, it can generalize to over 17k HOI classes while being trained on just 600 HOI classes.



Paperid:672
Authors:Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, Yanning Zhang
Northwestern Polytechnical University, Northwestern Polytechnical University, Singapore Management University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
The recent contrastive languageimage pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP.



Paperid:673
Authors:Pengfei Wu, Le Wang, Sanping Zhou, Gang Hua, Changyin Sun
Xi’an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Wormpex AI Research, Anhui University
Abstract:
Video Person ReIdentification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.



Paperid:674
Authors:Qiaoyun Wu, Quanxiao Zhang, Chunyu Tan, Yun Zhou, Changyin Sun
School of Artificial Intelligence, Anhui University Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education Anhui Provincial Engineering Research Center for Unmanned System and Intelligent Technology, School of Artificial Intelligence, Anhui University Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education Anhui Provincial Engineering Research Center for Unmanned System and Intelligent Technology, School of Artificial Intelligence, Anhui University Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education Anhui Provincial Engineering Research Center for Unmanned System and Intelligent Technology, School of Artificial Intelligence, Anhui University Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, School of Artificial Intelligence, Anhui University Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education Anhui Provincial Engineering Research Center for Unmanned System and Intelligent Technology
Abstract:
Spiking neural networks (SNNs) have revolutionized neural learning and are making remarkable strides in image analysis and robot control tasks with ultralow power consumption advantages. Inspired by this success, we investigate the application of spiking neural networks to 3D point cloud processing. We present a point-to-spike residual learning network for point cloud classification, which operates on points with binary spikes rather than floating-point numbers. Specifically, we first design a spatial-aware kernel point spiking neuron to relate spiking generation to point position in 3D space. On this basis, we then design a 3D spiking residual block for effective feature learning based on spike sequences. By stacking the 3D spiking residual blocks, we build the point-to-spike residual classification network, which achieves low computation cost and low accuracy loss on two benchmark datasets, ModelNet40 and ScanObjectNN. Moreover, the classifier strikes a good balance between classification accuracy and biological characteristics, allowing us to explore the deployment of 3D processing to neuromorphic chips for developing energy-efficient 3D robotic perception systems.



Paperid:675
Authors:Renjie Wu, Hu Wang, Feras Dayoub, Hsiang-Ting Chen
The University of Adelaide, The University of Adelaide, The University of Adelaide, The University of Adelaide
Abstract:
Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited fieldof-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.



Paperid:676
Authors:Shangbo Wu, Yu-an Tan, Yajie Wang, Ruinan Ma, Wencong Ma, Yuanzhang Li
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology
Abstract:
Adversarial transferability enables blackbox attacks on unknown victim deep neural networks (DNNs), rendering attacks viable in real-world scenarios. Current transferable attacks create adversarial perturbation over the entire image, resulting in excessive noise that overfit the source model. Concentrating perturbation to dominant image regions that are model-agnostic is crucial to improving adversarial efficacy. However, limiting perturbation to local regions in the spatial domain proves inadequate in augmenting transferability. To this end, we propose a transferable adversarial attack with fine-grained perturbation optimization in the frequency domain, creating centralized perturbation. We devise a systematic pipeline to dynamically constrain perturbation optimization to dominant frequency coefficients. The constraint is optimized in parallel at each iteration, ensuring the directional alignment of perturbation optimization with model prediction. Our approach allows us to centralize perturbation towards sample-specific important frequency features, which are shared by DNNs, effectively mitigating source model overfitting. Experiments demonstrate that by dynamically centralizing perturbation on dominating frequency coefficients, crafted adversarial examples exhibit stronger transferability, and allowing them to bypass various defenses.



Paperid:677
Authors:Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Wentao Liu, Chen Change Loy
Nanyang Technological University, Nanyang Technological University, The Chinese University of Hong Kong, The University of Hong Kong SenseTime Research and Tetras.AI, SenseTime Research and Tetras.AI Shanghai AI Laboratory, Nanyang Technological University
Abstract:
Detecting objects accurately from a large or open vocabulary necessitates the visionlanguage alignment on region representations. However, learning such a region-text alignment by obtaining high-quality box annotations with text labels or descriptions is expensive and infeasible. In contrast, collecting image-text pairs is simpler but lacks precise object location information to associate regions with texts. In this paper, we propose a novel approach called Contrastive Language-Image Mosaic (CLIM), which leverages large-scale image-text pairs effectively for aligning region and text representations. CLIM combines multiple images into a mosaicked image and treats each image as a ‘pseudo region’. The feature of each pseudo region is extracted and trained to be similar to the corresponding text embedding while dissimilar from others by a contrastive loss, enabling the model to learn the region-text alignment without costly box annotations. As a generally applicable approach, CLIM consistently improves different open-vocabulary object detection methods that use caption supervision. Furthermore, CLIM can effectively enhance the region representation of vision-language models, thus providing stronger backbones for open-vocabulary object detectors. Our experimental results demonstrate that CLIM improves different baseline open-vocabulary object detectors by a large margin on both OV-COCO and OV-LVIS benchmarks. The code is available at https://github.com/wusize/CLIM.



Paperid:678
Authors:Tao Wu, Xuewei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li
College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, ARC Lab, Tencent PCG, Gaoling School of Artificial Intelligence, Renmin University of China, ARC Lab, Tencent PCG, ARC Lab, Tencent PCG, College of Computer Science and Technology, Zhejiang University Zhejiang – Singapore Innovation and AI Joint Research Lab, Hangzhou
Abstract:
Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in lowquality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.



Paperid:679
Authors:Tao Wu, Tie Luo, Donald C. Wunsch II
Department of Computer Science, Missouri University of Science and Technology, Department of Computer Science, Missouri University of Science and Technology, Department of Electrical and Computer Engineering, Missouri University of Science and Technology
Abstract:
The transferability of adversarial examples is of central importance to transferbased black-box adversarial attacks. Previous works for generating transferable adversarial examples focus on attacking given pretrained surrogate models while the connections between surrogate models and adversarial trasferability have been overlooked. In this paper, we propose Lipschitz Regularized Surrogate (LRS) for transfer-based black-box attacks, a novel approach that transforms surrogate models towards favorable adversarial transferability. Using such transformed surrogate models, any existing transfer-based black-box attack can run without any change, yet achieving much better performance. Specifically, we impose Lipschitz regularization on the loss landscape of surrogate models to enable a smoother and more controlled optimization process for generating more transferable adversarial examples. In addition, this paper also sheds light on the connection between the inner properties of surrogate models and adversarial transferability, where three factors are identified: smaller local Lipschitz constant, smoother loss landscape, and stronger adversarial robustness. We evaluate our proposed LRS approach by attacking state-of-the-art standard deep neural networks and defense models. The results demonstrate significant improvement on the attack success rates and transferability. Our code is available at https://github.com/TrustAIoT/LRS.



Paperid:680
Authors:Tao Wu, Tie Luo, Donald C. Wunsch II
Department of Computer Science, Missouri University of Science and Technology, Department of Computer Science, Missouri University of Science and Technology, Department of Electrical and Computer Engineering, Missouri University of Science and Technology
Abstract:
The capacity to generalize to future unseen data stands as one of the utmost crucial attributes of deep neural networks. SharpnessAware Minimization (SAM) aims to enhance the generalizability by minimizing worst-case loss using one-step gradient ascent as an approximation. However, as training progresses, the non-linearity of the loss landscape increases, rendering one-step gradient ascent less effective. On the other hand, multi-step gradient ascent will incur higher training cost. In this paper, we introduce a normalized Hessian trace to accurately measure the curvature of loss landscape on both training and test sets. In particular, to counter excessive non-linearity of loss landscape, we propose Curvature Regularized SAM (CR-SAM), integrating the normalized Hessian trace as a SAM regularizer. Additionally, we present an efficient way to compute the trace via finite differences with parallelism. Our theoretical analysis based on PAC-Bayes bounds establishes the regularizer's efficacy in reducing generalization error. Empirical evaluation on CIFAR and ImageNet datasets shows that CR-SAM consistently enhances classification performance for ResNet and Vision Transformer (ViT) models across various datasets. Our code is available at https://github.com/TrustAIoT/CR-SAM.



Paperid:681
Authors:Xiaopei Wu, Liang Peng, Liang Xie, Yuenan Hou, Binbin Lin, Xiaoshui Huang, Haifeng Liu, Deng Cai, Wanli Ouyang
State Key Lab of CAD&CG, Zhejiang University Shanghai AI Laboratory, State Key Lab of CAD&CG, Zhejiang University, State Key Lab of CAD&CG, Zhejiang University, Shanghai AI Laboratory, School of Software Technology, Zhejiang University, Shanghai AI Laboratory, State Key Lab of CAD&CG, Zhejiang University, State Key Lab of CAD&CG, Zhejiang University, Shanghai AI Laboratory
Abstract:
Semisupervised learning aims to leverage numerous unlabeled data to improve the model performance. Current semi-supervised 3D object detection methods typically use a teacher to generate pseudo labels for a student, and the quality of the pseudo labels is essential for the final performance. In this paper, we propose PatchTeacher, which focuses on partial scene 3D object detection to provide high-quality pseudo labels for the student. Specifically, we divide a complete scene into a series of patches and feed them to our PatchTeacher sequentially. PatchTeacher leverages the low memory consumption advantage of partial scene detection to process point clouds with a high-resolution voxelization, which can minimize the information loss of quantization and extract more fine-grained features. However, it is non-trivial to train a detector on fractions of the scene. Therefore, we introduce three key techniques, i.e., Patch Normalizer, Quadrant Align, and Fovea Selection, to improve the performance of PatchTeacher. Moreover, we devise PillarMix, a strong data augmentation strategy that mixes truncated pillars from different LiDAR scans to generate diverse training samples and thus help the model learn more general representation. Extensive experiments conducted on Waymo and ONCE datasets verify the effectiveness and superiority of our method and we achieve new state-of-the-art results, surpassing existing methods by a large margin. Codes are available at https://github.com/LittlePey/PTPM.



Paperid:682
Authors:Xinyi Wu, Wentao Ma, Dan Guo, Tongqing Zhou, Shan Zhao, Zhiping Cai
National University of Defense Technology, Anhui Agricultural University, Hefei University of Technology, National University of Defense Technology, Hefei University of Technology, National University of Defense Technology
Abstract:
Textbased Person Re-identification (T-ReID), which aims at retrieving a specific pedestrian image from a collection of images via text-based information, has received significant attention. However, previous research has overlooked a challenging yet practical form of T-ReID: dealing with image galleries mixed with occluded and inconsistent personal visuals, instead of ideal visuals with a full-body and clear view. Its major challenges lay in the insufficiency of benchmark datasets and the enlarged semantic gap incurred by arbitrary occlusions and modality gap between text description and visual representation of the target person. To alleviate these issues, we first design an Occlusion Generator (OGor) for the automatic generation of artificial occluded images from generic surveillance images. Then, a fine-granularity token selection mechanism is proposed to minimize the negative impact of occlusion for robust feature learning, and a novel multi-granularity contrastive consistency alignment framework is designed to leverage intra-/inter-granularity of visual-text representations for semantic alignment of occluded visuals and query texts. Experimental results demonstrate that our method exhibits superior performance. We believe this work could inspire the community to investigate more dedicated designs for implementing T-ReID in real-world scenarios. The source code is available at https://github.com/littlexinyi/MGCC.



Paperid:683
Authors:Yingrui Wu, Mingyang Zhao, Keqiang Li, Weize Quan, Tianqi Yu, Jianfeng Yang, Xiaohong Jia, Dong-Ming Yan
MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing, China, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong, China, SenseTime Research, Shanghai, China, MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing, China, School of Electronic and Information Engineering, Soochow University, Suzhou, China, School of Electronic and Information Engineering, Soochow University, Suzhou, China, AMSS, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing, China, MAIS, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Beijing, China
Abstract:
This work presents an accurate and robust method for estimating normals from point clouds. In contrast to predecessor approaches that minimize the deviations between the annotated and the predicted normals directly, leading to direction inconsistency, we first propose a new metric termed Chamfer Normal Distance to address this issue. This not only mitigates the challenge but also facilitates network training and substantially enhances the network robustness against noise. Subsequently, we devise an innovative architecture that encompasses Multiscale Local Feature Aggregation and Hierarchical Geometric Information Fusion. This design empowers the network to capture intricate geometric details more effectively and alleviate the ambiguity in scale selection. Extensive experiments demonstrate that our method achieves the state-of-the-art performance on both synthetic and real-world datasets, particularly in scenarios contaminated by noise. Our implementation is available at https://github.com/YingruiWoo/CMG-Net_Pytorch.



Paperid:684
Authors:Zhiliang Wu, Changchang Sun, Hanyu Xuan, Gaowen Liu, Yan Yan
CCAI, Zhejiang University, China, Department of Computer Science, Illinois Institute of Technology, USA, School of Big Data and Statistics, Anhui University, China, Cisco Research, USA, Department of Computer Science, Illinois Institute of Technology, USA
Abstract:
Video inpainting aims to fill in the missing regions of the video frames with plausible content. Benefiting from the outstanding longrange modeling capacity, the transformer-based models have achieved unprecedented performance regarding inpainting quality. Essentially, coherent contents from all the frames along both spatial and temporal dimensions are concerned by a patch-wise attention module, and then the missing contents are generated based on the attention-weighted summation. In this way, attention retrieval accuracy has become the main bottleneck to improve the video inpainting performance, where the factors affecting attention calculation should be explored to maximize the advantages of transformer. Towards this end, in this paper, we theoretically certificate that noise is the culprit that entangles the process of attention calculation. Meanwhile, we propose a novel wavelet transformer network with noise robustness for video inpainting, named WaveFormer. Unlike existing transformer-based methods that utilize the whole embeddings to calculate the attention, our WaveFormer first separates the noise existing in the embedding into high-frequency components by introducing the Discrete Wavelet Transform (DWT), and then adopts clean low-frequency components to calculate the attention. In this way, the impact of noise on attention computation can be greatly mitigated and the missing content regarding different frequencies can be generated by sharing the calculated attention. Extensive experiments validate the superior performance of our method over state-of-the-art baselines both qualitatively and quantitatively.



Paperid:685
Authors:Zizhang Wu, Yuanzhu Gan, Yunzhe Wu, Ruihao Wang, Xiaoquan Wang, Jian Pu
Fudan University, ZongmuTech, ZongmuTech, ZongmuTech, ExploAI, Fudan University
Abstract:
Monocular 3D object detection usually adopts direct or hierarchical label supervision. Recently, the distillation supervision transfers the spatial knowledge from LiDARor stereo-based teacher networks to monocular detectors, but remaining the domain gap. To mitigate this issue and pursue adequate label manipulation, we exploit Foreground Depth map for feature-supervised monocular 3D object detection named FD3D, which develops the high-quality instructive intermediate features to conduct desirable auxiliary feature supervision with only the original image and annotation foreground object-wise depth map (AFOD) as input. Furthermore, we build up our instructive feature generation network to create instructive spatial features based on the sufficient correlation between image features and pre-processed AFOD, where AFOD provides the attention focus only on foreground objects to achieve clearer guidance in the detection task. Moreover, we apply the auxiliary feature supervision from the pixel and distribution level to achieve comprehensive spatial knowledge guidance. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both the KITTI and nuScenes datasets, with no external data and no extra inference computational cost. We also conduct quantitative and qualitative studies to reveal the effectiveness of our designs.



Paperid:686
Authors:Jiaer Xia, Lei Tan, Pingyang Dai, Mingbo Zhao, Yongjian Wu, Liujuan Cao
Xiamen University, Xiamen University, Xiamen University, Donghua University, Tencent Technology (Shanghai) Co.,Ltd, Xiamen University
Abstract:
Occluded person reidentification (Re-ID) aims to address the potential occlusion problem when matching occluded or holistic pedestrians from different camera views. Many methods use the background as artificial occlusion and rely on attention networks to exclude noisy interference. However, the significant discrepancy between simple background occlusion and realistic occlusion can negatively impact the generalization of the network. To address this issue, we propose a novel transformer-based Attention Disturbance and Dual-Path Constraint Network (ADP) to enhance the generalization of attention networks. Firstly, to imitate real-world obstacles, we introduce an Attention Disturbance Mask (ADM) module that generates an offensive noise, which can distract attention like a realistic occluder, as a more complex form of occlusion. Secondly, to fully exploit these complex occluded images, we develop a DualPath Constraint Module (DPC) that can obtain preferable supervision information from holistic images through dualpath interaction. With our proposed method, the network can effectively circumvent a wide variety of occlusions using the basic ViT baseline. Comprehensive experimental evaluations conducted on person re-ID benchmarks demonstrate the superiority of ADP over state-of-the-art methods.



Paperid:687
Authors:Yifan Xia, Yifan Lu, Yuan Gao, Jiayi Ma
Wuhan University, WuHan University, Wuhan University, Wuhan University
Abstract:
In this paper, we address the nonrigid shape matching with outliers by a novel and effective pointwise map refinement method, termed Locality Preserving Refinement. For accurate pointwise conversion from a given functional map, our method formulates a twostep procedure. Firstly, starting with noisy point-to-point correspondences, we identify inliers by leveraging the neighborhood support, which yields a closed-form solution with linear time complexity. After obtained the reliable correspondences of inliers, we refine the pointwise correspondences for outliers using local linear embedding, which operates in an adaptive spectral similarity space to further eliminate the ambiguities that are difficult to handle in the functional space. By refining pointwise correspondences with local consistency thus embedding geometric constraints into functional spaces, our method achieves considerable improvement in accuracy with linearithmic time and space cost. Extensive experiments on public benchmarks demonstrate the superiority of our method over the state-of-the-art methods. Our code is publicly available at https://github.com/XiaYifan1999/LOPR.



Paperid:688
Authors:Wei Xiang, Haoteng YIN, He Wang, Xiaogang Jin
Zhejiang University, Purdue University, University College London, Zhejiang University
Abstract:
Pedestrian trajectory prediction is the key technology in many applications for providing insights into human behavior and anticipating human future motions. Most existing empirical models are explicitly formulated by observed human behaviors using explicable mathematical terms with deterministic nature, while recent work has focused on developing hybrid models combined with learningbased techniques for powerful expressiveness while maintaining explainability. However, the deterministic nature of the learned steering behaviors from the empirical models limits the models' practical performance. To address this issue, this work proposes the social conditional variational autoencoder (SocialCVAE) for predicting pedestrian trajectories, which employs a CVAE to explore behavioral uncertainty in human motion decisions. SocialCVAE learns socially reasonable motion randomness by utilizing a socially explainable interaction energy map as the CVAE's condition, which illustrates the future occupancy of each pedestrian's local neighborhood area. The energy map is generated using an energy-based interaction model, which anticipates the energy cost (i.e., repulsion intensity) of pedestrians' interactions with neighbors. Experimental results on two public benchmarks including 25 scenes demonstrate that SocialCVAE significantly improves prediction accuracy compared with the state-of-the-art methods, with up to 16.85% improvement in Average Displacement Error (ADE) and 69.18% improvement in Final Displacement Error (FDE). Code is available at: https://github.com/ViviXiang/SocialCVAE.



Paperid:689
Authors:Jianyang Xie, Yanda Meng, Yitian Zhao, Anh Nguyen, Xiaoyun Yang, Yalin Zheng
CDT in Distributed Algorithms, School of EEE&CS, University of Liverpool, UK Department of Eye and Vision Sciences, University of Liverpool, Liverpool, UK, Department of Eye and Vision Sciences, University of Liverpool, Liverpool, UK Liverpool Centre for Cardiovascular Science, Liverpool, UK, Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, CAS, Cixi, China, Department of Computer Sciences, University of Liverpool, Liverpool, UK, Remark AI UK Limited, London, UK, Department of Eye and Vision Sciences, University of Liverpool, Liverpool, UK Liverpool Centre for Cardiovascular Science, Liverpool, UK
Abstract:
Graph convolutional networks (GCNs) have attracted great attention and achieved remarkable performance in skeletonbased action recognition. However, most of the previous works are designed to refine skeleton topology without considering the types of different joints and edges, making them infeasible to represent the semantic information. In this paper, we proposed a dynamic semantic-based graph convolution network (DS-GCN) for skeleton-based human action recognition, where the joints and edge types were encoded in the skeleton topology in an implicit way. Specifically, two semantic modules, the joints type-aware adaptive topology and the edge type-aware adaptive topology, were proposed. Combining proposed semantics modules with temporal convolution, a powerful framework named DS-GCN was developed for skeleton-based action recognition. Extensive experiments in two datasets, NTU-RGB+D and Kinetics-400 show that the proposed semantic modules were generalized enough to be utilized in various backbones for boosting recognition accuracy. Meanwhile, the proposed DS-GCN notably outperformed state-of-the-art methods. The code is released here https://github.com/davelailai/DS-GCN



Paperid:690
Authors:Pan Xie, Qipeng Zhang, Peng Taiying, Hao Tang, Yao Du, Zexian Li
Beihang University, Beihang University, Beihang University, Carnegie Mellon University, Beihang University, Beihang University
Abstract:
The Sign Language Production (SLP) project aims to automatically translate spoken languages into sign sequences. Our approach focuses on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P). In this paper, we present a novel solution for this task by converting the continuous pose space generation problem into a discrete sequence generation problem. We introduce the PoseVQVAE framework, which combines Variational Autoencoders (VAEs) with vector quantization to produce a discrete latent representation for continuous pose sequences. Additionally, we propose the G2P-DDM model, a discrete denoising diffusion architecture for length-varied discrete sequence data, to model the latent prior. To further enhance the quality of pose sequence generation in the discrete space, we present the CodeUnet model to leverage spatial-temporal information. Lastly, we develop a heuristic sequential clustering method to predict variable lengths of pose sequences for corresponding gloss sequences. Our results show that our model outperforms state-of-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm.



Paperid:691
Authors:Zhao Xie, Yadong Shi, Kewei Wu, Yaru Cheng, Dan Guo
Hefei University of Technology, Hefei University of Technology, Hefei University of Technology, Hefei University of Technology, Hefei University of Technology Hefei Comprehensive National Science Center Anhui Zhonghuitong Technology Co., Ltd
Abstract:
Action anticipation aims to infer the action in the unobserved segment (future segment) with the observed segment (past segment). Existing methods focus on learning key past semantics to predict the future, but they do not model the temporal continuity between the past and the future. However, past actions are always highly uncertain in anticipating the unobserved future. The absence of temporal continuity smoothing in the video's pastand-future segments may result in an inconsistent anticipation of future action. In this work, we aim to smooth the global semantics changes in the past and future segments. We propose a Consistency-guided Probabilistic Model (CPM), which focuses on learning the globally temporal probabilistic consistency to inhibit the unexpected temporal consistency. The CPM is deployed on the Transformer architecture, which includes three modules of future semantics estimation, global semantics estimation, and global distribution estimation involving the learning of past-to-future semantics, past-and-future semantics, and semantically probabilistic distributions. To achieve the smoothness of temporal continuity, we follow the principle of variational analysis and describe two probabilistic distributions, i.e., a past-aware distribution and a global-aware distribution, which help to estimate the evidence lower bound of future anticipation. In this study, we maximize the evidence lower bound of future semantics by reducing the distribution distance between the above two distributions for model optimization. Extensive experiments demonstrate that the effectiveness of our method and the CPM achieves state-of-the-art performance on Epic-Kitchen100, Epic-Kitchen55, and EGTEA-GAZE.



Paperid:692
Authors:Zhenyu Xie, Yang Wu, Xuehao Gao, Zhongqian Sun, Wei Yang, Xiaodan Liang
Sun Yat-sen University, Tencent AI Lab, Xi'an Jiao Tong University, Tencent AI Lab, Tencent AI Lab, Sun Yat-sen University DarkMatter AI Research
Abstract:
Textguided motion synthesis aims to generate 3D human motion that not only precisely reflects the textual description but reveals the motion details as much as possible. Pioneering methods explore the diffusion model for text-to-motion synthesis and obtain significant superiority. However, these methods conduct diffusion processes either on the raw data distribution or the low-dimensional latent space, which typically suffer from the problem of modality inconsistency or detail-scarce. To tackle this problem, we propose a novel Basic-to-Advanced Hierarchical Diffusion Model, named B2A-HDM, to collaboratively exploit low-dimensional and high-dimensional diffusion models for high quality detailed motion synthesis. Specifically, the basic diffusion model in low-dimensional latent space provides the intermediate denoising result that to be consistent with the textual description, while the advanced diffusion model in high-dimensional latent space focuses on the following detail-enhancing denoising process. Besides, we introduce a multi-denoiser framework for the advanced diffusion model to ease the learning of high-dimensional model and fully explore the generative potential of the diffusion model. Quantitative and qualitative experiment results on two text-to-motion benchmarks (HumanML3D and KIT-ML) demonstrate that B2A-HDM can outperform existing state-of-the-art methods in terms of fidelity, modality consistency, and diversity.



Paperid:693
Authors:Meng Xing, Zhiyong Feng, Yong Su, Changjae Oh
Tianjin University; Queen Mary University of London, Tianjin University, Tianjin Normal University, Queen Mary University of London
Abstract:
Detecting OOD inputs is crucial to deploy machine learning models to the real world safely. However, existing OOD detection methods require an indistribution (ID) dataset to retrain the models. In this paper, we propose a Deep Generative Models (DGMs) based transferable OOD detection that does not require retraining on the new ID dataset. We first establish and substantiate two hypotheses on DGMs: DGMs exhibit a predisposition towards acquiring low-level features, in preference to semantic information; the lower bound of DGM's log-likelihoods is tied to the conditional entropy between the model input and target output. Drawing on the aforementioned hypotheses, we present an innovative image-erasing strategy, which is designed to create distinct conditional entropy distributions for each individual ID dataset. By training a DGM on a complex dataset with the proposed image-erasing strategy, the DGM could capture the discrepancy of conditional entropy distribution for varying ID datasets, without re-training. We validate the proposed method on the five datasets and show that, without retraining, our method achieves comparable performance to the state-of-the-art group-based OOD detection methods. The project codes will be open-sourced on our project website.



Paperid:694
Authors:Zheng Xing, Weibing Zhao
The Chinese University of Hong Kong, Shenzhen, The Chinese University of Hong Kong, Shenzhen
Abstract:
Action segmentation serves as a pivotal component in comprehending videos, encompassing the learning of a sequence of semantically consistent action units known as actoms. Conventional methodologies tend to require a significant consumption of time for both training and learning phases. This paper introduces an innovative unsupervised framework for action segmentation in video, characterized by its fast learning capability and absence of mandatory training. The core idea involves splitting the video into distinct actoms, which are then merging together based on shared actions. The key challenge here is to prevent the inadvertent creation of singular actoms that attempt to represent multiple actions during the splitting phase. Additionally, it is crucial to avoid situations where actoms associated with the same action are incorrectly grouped into multiple clusters during the merging phase. In this paper, we present a method for calculating the similarity between adjacent frames under a subspace assumption. Then, we employ a local minimum searching procedure, which effectively splits the video into coherent actoms aligned with their semantic meaning and provides us an action segmentation proposal. Subsequently, we calculate a spatiotemporal similarity between actoms, followed by developing a merging process to merge actoms representing identical actions within the action segmentation proposals. Our approach is evaluated on four benchmark datasets, and the results demonstrate that our method achieves state-of-the-art performance. Besides, our method also achieves the optimal balance between accuracy and learning time when compared to existing unsupervised techniques. Code is available at https://github.com/y66y/SaM.



Paperid:695
Authors:Kezheng Xiong, Maoji Zheng, Qingshan Xu, Chenglu Wen, Siqi Shen, Cheng Wang
Xiamen University, Xiamen University, Nanyang Technological University, Xiamen University, Xiamen University, Xiamen University
Abstract:
Point cloud registration, a fundamental task in 3D computer vision, has remained largely unexplored in crosssource point clouds and unstructured scenes. The primary challenges arise from noise, outliers, and variations in scale and density. However, neglected geometric natures of point clouds restricts the performance of current methods. In this paper, we propose a novel method termed SPEAL to leverage skeletal representations for effective learning of intrinsic topologies of point clouds, facilitating robust capture of geometric intricacy. Specifically, we design the Skeleton Extraction Module to extract skeleton points and skeletal features in an unsupervised manner, which is inherently robust to noise and density variances. Then, we propose the Skeleton-Aware GeoTransformer to encode high-level skeleton-aware features. It explicitly captures the topological natures and inter-point-cloud skeletal correlations with the noise-robust and density-invariant skeletal representations. Next, we introduce the Correspondence Dual-Sampler to facilitate correspondences by augmenting the correspondence set with skeletal correspondences. Furthermore, we construct a challenging novel cross-source point cloud dataset named KITTI CrossSource for benchmarking cross-source point cloud registration methods. Extensive quantitative and qualitative experiments are conducted to demonstrate our approach’s superiority and robustness on both cross-source and same-source datasets. To the best of our knowledge, our approach is the first to facilitate point cloud registration with skeletal geometric priors.



Paperid:696
Authors:Jiakun Xu, Bowen Xu, Gui-Song Xia, Liang Dong, Nan Xue
Wuhan University, Wuhan University, Wuhan University, Google Inc., Wuhan University Ant Group
Abstract:
This paper presents a novel approach to computing vector road maps from satellite remotely sensed images, building upon a welldefined Patched Line Segment (PaLiS) representation for road graphs that holds geometric significance. Unlike prevailing methods that derive road vector representations from satellite images using binary masks or keypoints, our method employs line segments. These segments not only convey road locations but also capture their orientations, making them a robust choice for representation. More precisely, given an input image, we divide it into non-overlapping patches and predict a suitable line segment within each patch. This strategy enables us to capture spatial and structural cues from these patch-based line segments, simplifying the process of constructing the road network graph without the necessity of additional neural networks for connectivity. In our experiments, we demonstrate how an effective representation of a road graph significantly enhances the performance of vector road mapping on established benchmarks, without requiring extensive modifications to the neural network architecture. Furthermore, our method achieves state-of-the-art performance with just 6 GPU hours of training, leading to a substantial 32-fold reduction in training costs in terms of GPU hours.



Paperid:697
Authors:Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi
Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group
Abstract:
Videoand-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.



Paperid:698
Authors:Junkai Xu, Liang Peng, Haoran Cheng, Linxuan Xia, Qi Zhou, Dan Deng, Wei Qian, Wenxiao Wang, Deng Cai
State Key Lab of CAD&CG, Zhejiang University Fabu Inc., State Key Lab of CAD&CG, Zhejiang University Fabu Inc., State Key Lab of CAD&CG, Zhejiang University Fabu Inc., State Key Lab of CAD&CG, Zhejiang University Fabu Inc., State Key Lab of CAD&CG, Zhejiang University Fabu Inc., Fabu Inc., Fabu Inc., School of Software Technology, Zhejiang University, State Key Lab of CAD&CG, Zhejiang University Fabu Inc.
Abstract:
Multicamera perception tasks have gained significant attention in the field of autonomous driving. However, existing frameworks based on Lift-Splat-Shoot (LSS) in the multi-camera setting cannot produce suitable dense 3D features due to the projection nature and uncontrollable densification process. To resolve this problem, we propose to regulate intermediate dense 3D features with the help of volume rendering. Specifically, we employ volume rendering to process the dense 3D features to obtain corresponding 2D features (e.g., depth maps, semantic maps), which are supervised by associated labels in the training. This manner regulates the generation of dense 3D features on the feature level, providing appropriate dense and unified features for multiple perception tasks. Therefore, our approach is termed Vampire, stands for ``Volume rendering As Multi-camera Perception Intermediate feature REgulator''. Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features, and is competitive with existing SOTA methods across diverse downstream perception tasks like 3D occupancy prediction, LiDAR segmentation and 3D objection detection, while utilizing moderate GPU resources. We provide a video demonstration in the supplementary materials and Codes are available at github.com/cskkxjk/Vampire.



Paperid:699
Authors:Ke Xu, Tsun Wai Siu, Rynson W.H. Lau
City University of Hong Kong, City University of Hong Kong, City University of Hong Kong
Abstract:
Mirror detection is an active research topic in computer vision. However, all existing mirror detectors learn mirror representations from largescale pixel-wise datasets, which are tedious and expensive to obtain. Although weakly-supervised learning has been widely explored in related topics, we note that popular weak supervision signals (e.g., bounding boxes, scribbles, points) still require some efforts from the user to locate the target objects, with a strong assumption that the images to annotate always contain the target objects. Such an assumption may result in the over-segmentation of mirrors. Our key idea of this work is that the existence of mirrors over a time period may serve as a weak supervision to train a mirror detector, for two reasons. First, if a network can predict the existence of mirrors, it can essentially locate the mirrors. Second, we observe that the reflected contents of a mirror tend to be similar to those in adjacent frames, but exhibit considerable contrast to regions in far-away frames (e.g., non-mirror frames). To this end, in this paper, we propose ZOOM, the first method to learn robust mirror representations from extremely-weak annotations of per-frame ZerO-One Mirror indicators in videos. The key insight of ZOOM is to model the similarity and contrast (between mirror and non-mirror regions) in temporal variations to locate and segment the mirrors. To this end, we propose a novel fusion strategy to leverage temporal consistency information for mirror localization, and a novel temporal similarity-contrast modeling module for mirror segmentation. We construct a new video mirror dataset for training and evaluation. Experimental results under new and standard metrics show that ZOOM performs favorably against existing fully-supervised mirror detection methods.



Paperid:700
Authors:Lingjing Xu, Yang Gao, Wenfeng Song, Aimin Hao
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China, Computer School, Beijing Information Science and Technology University, China, State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China
Abstract:
To enhance the interaction between intelligent systems and the environment, locating the affordance regions of objects is crucial. These regions correspond to specific areas that provide distinct functionalities. Humans often acquire the ability to identify these regions through action demonstrations and verbal instructions. In this paper, we present a novel multimodal framework that extracts affordance knowledge from exocentric images, which depict humanobject interactions, as well as from accompanying textual descriptions that describe the performed actions. The extracted knowledge is then transferred to egocentric images. To achieve this goal, we propose the HOI-Transfer Module, which utilizes local perception to disentangle individual actions within exocentric images. This module effectively captures localized features and correlations between actions, leading to valuable affordance knowledge. Additionally, we introduce the Pixel-Text Fusion Module, which fuses affordance knowledge by identifying regions in egocentric images that bear resemblances to the textual features defining affordances. We employ a Weakly Supervised Multimodal Affordance (WSMA) learning approach, utilizing image-level labels for training. Through extensive experiments, we demonstrate the superiority of our proposed method in terms of evaluation metrics and visual results when compared to existing affordance grounding models. Furthermore, ablation experiments confirm the effectiveness of our approach. Code:https://github.com/xulingjing88/WSMA.



Paperid:701
Authors:Mingjie Xu, Feng Lu
State Key Laboratory of VR Technology and Systems, School of CSE, Beihang University, Beijing, China, State Key Laboratory of VR Technology and Systems, School of CSE, Beihang University, Beijing, China Peng Cheng Laboratory, Shenzhen, China
Abstract:
Gaze estimation aims to accurately estimate the direction or position at which a person is looking. With the development of deep learning techniques, a number of gaze estimation methods have been proposed and achieved stateof-the-art performance. However, these methods are limited to within-dataset settings, whose performance drops when tested on unseen datasets. We argue that this is caused by infinite and continuous gaze labels. To alleviate this problem, we propose using gaze frontalization as an auxiliary task to constrain gaze estimation. Based on this, we propose a novel gaze domain generalization framework named Gaze Frontalization-based Auxiliary Learning (GFAL) Framework which embeds the gaze frontalization process, i.e., guiding the feature so that the eyeball can rotate and look at the front (camera), without any target domain information during training. Experimental results show that our proposed framework is able to achieve state-of-the-art performance on gaze domain generalization task, which is competitive with or even superior to the SOTA gaze unsupervised domain adaptation methods.



Paperid:702
Authors:QiHao Xu, Xiaoling Luo, Chao Huang, Chengliang Liu, Jie Wen, Jialei Wang, Yong Xu
Shenzhen University, Shenzhen Harbin Institute of Technology, Shenzhen, Shenzhen University, Shenzhen Harbin Institute of Technology, Shenzhen, Shenzhen Campus of Sun Yat-sen University, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen
Abstract:
Diabetic Retinopathy (DR), the leading cause of blindness in diabetic patients, is diagnosed by the condition of retinal multiple lesions. As a difficult task in medical image segmentation, DR multilesion segmentation faces the main concerns as follows. On the one hand, retinal lesions vary in location, shape, and size. On the other hand, because some lesions occupy only a very small part of the entire fundus image, the high proportion of background leads to difficulties in lesion segmentation. To solve the above problems, we propose a heterogeneous-aware convolutional network (HACDR-Net) that composes heterogeneous cross-convolution, heterogeneous modulated deformable convolution, and optional near-far-aware convolution. Our network introduces an adaptive aggregation module to summarize the heterogeneous feature maps and get diverse lesion areas in the heterogeneous receptive field along the channels and space. In addition, to solve the problem of the highly imbalanced proportion of focal areas, we design a new medical image segmentation loss function, Noise Adjusted Loss (NALoss). NALoss balances the predictive feature distribution of background and lesion by jointing Gaussian noise and hard example mining, thus enhancing awareness of lesions. We conduct the experiments on the public datasets IDRiD and DDR, and the experimental results show that the proposed method achieves better performance than other state-of-the-art methods. The code is open-sourced on github.com/xqh180110910537/HACDR-Net.



Paperid:703
Authors:Sen Xu, Shikui Wei, Tao Ruan, Lixin Liao
Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, Frontiers Science Center for Smart High-speed Railway System, Beijing Jiaotong University, DaoAI Robotics Inc.
Abstract:
Deep superpixel algorithms have made remarkable strides by substituting handcrafted features with learnable ones. Nevertheless, we observe that existing deep superpixel methods, serving as mid-level representation operations, remain sensitive to the statistical properties (e.g., color distribution, high-level semantics) embedded within the training dataset. Consequently, learnable features exhibit constrained discriminative capability, resulting in unsatisfactory pixel grouping performance, particularly in untrainable application scenarios. To address this issue, we propose the Content Disentangle Superpixel (CDS) algorithm to selectively separate the invariant inter-pixel correlations and statistical properties, i.e., style noise. Specifically, We first construct auxiliary modalities that are homologous to the original RGB image but have substantial stylistic variations. Then, driven by mutual information, we propose the local-grid correlation alignment across modalities to reduce the distribution discrepancy of adaptively selected features and learn invariant inter-pixel correlations. Afterwards, we perform global-style mutual information minimization to enforce the separation of invariant content and train data styles. The experimental results on four benchmark datasets demonstrate the superiority of our approach to existing state-of-the-art methods, regarding boundary adherence, generalization, and efficiency. Code and pre-trained model are available at https://github.com/rookiie/CDSpixel.



Paperid:704
Authors:Shuning Xu, Binbin Song, Xiangyu Chen, Jiantao Zhou
State Key Laboratory of Internet of Things for Smart City, University of Macau, State Key Laboratory of Internet of Things for Smart City, University of Macau, State Key Laboratory of Internet of Things for Smart City, University of Macau Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, State Key Laboratory of Internet of Things for Smart City, University of Macau
Abstract:
Moiré patterns occur when capturing images or videos on screens, severely degrading the quality of the captured images or videos. Despite the recent progresses, existing video demoiréing methods neglect the physical characteristics and formation process of moiré patterns, significantly limiting the effectiveness of video recovery. This paper presents a unified framework, DTNet, a directionaware and temporal-guided bilateral learning network for video demoiréing. DTNet effectively incorporates the process of moiré pattern removal, alignment, color correction, and detail refinement. Our proposed DTNet comprises two primary stages: Frame-level Direction-aware Demoiréing and Alignment (FDDA) and Tone and Detail Refinement (TDR). In FDDA, we employ multiple directional DCT modes to perform the moiré pattern removal process in the frequency domain, effectively detecting the prominent moiré edges. Then, the coarse and fine-grained alignment is applied on the demoiréd features for facilitating the utilization of neighboring information. In TDR, we propose a temporal-guided bilateral learning pipeline to mitigate the degradation of color and details caused by the moiré patterns while preserving the restored frequency information in FDDA. Guided by the aligned temporal features from FDDA, the affine transformations for the recovery of the ultimate clean frames are learned in TDR. Extensive experiments demonstrate that our video demoiréing method outperforms state-of-the-art approaches by 2.3 dB in PSNR, and also delivers a superior visual experience.



Paperid:705
Authors:Wenhao Xu, Rongtao Xu, Changwei Wang, Shibiao Xu, Li Guo, Man Zhang, Xiaopeng Zhang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence,University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence,University of Chinese Academy of Sciences, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Abstract:
Recently, CLIP has found practical utility in the domain of pixellevel zero-shot segmentation tasks. The present landscape features two-stage methodologies beset by issues such as intricate pipelines and elevated computational costs. While current one-stage approaches alleviate these concerns and incorporate Visual Prompt Training (VPT) to uphold CLIP's generalization capacity, they still fall short in fully harnessing CLIP's potential for pixel-level unseen class demarcation and precise pixel predictions. To further stimulate CLIP's zero-shot dense prediction capability, we propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel. Specifically, we initially introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers to capture structural intricacies of images, thereby enhancing comprehension of unseen classes. Subsequently, we introduce the Spectral Guided Decoder (SGD), utilizing both high and low-frequency information to steer the network's spatial focus towards more prominent classification features, enabling precise pixel-level prediction outcomes. Through extensive experiments on two public datasets, we demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes.



Paperid:706
Authors:Zhengze Xu, Dongyue Wu, Changqian Yu, Xiangxiang Chu, Nong Sang, Changxin Gao
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Meituan, Meituan, Huazhong University of Science and Technology, Huazhong University of Science and Technology
Abstract:
Recent realtime semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet.



Paperid:707
Authors:Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, Xiu Li
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
This study aims to improve the generation of 3D gestures by utilizing multimodal information from human speech. Previous studies have focused on incorporating additional modalities to enhance the quality of generated gestures. However, these methods perform poorly when certain modalities are missing during inference. To address this problem, we suggest using speechderived multimodal priors to improve gesture generation. We introduce a novel method that separates priors from speech and employs multimodal priors as constraints for generating gestures. Our approach utilizes a chain-like modeling method to generate facial blendshapes, body movements, and hand gestures sequentially. Specifically, we incorporate rhythm cues derived from facial deformation and stylization prior based on speech emotions, into the process of generating gestures. By incorporating multimodal priors, our method improves the quality of generated gestures and eliminate the need for expensive setup preparation during inference. Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance.



Paperid:708
Authors:Shiyu Xuan, Shiliang Zhang
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China
Abstract:
Supervised Contrastive Loss (SCL) is popular in visual representation learning. Given an anchor image, SCL pulls two types of positive samples, i.e., its augmentation and other images from the same class together, while pushes negative images apart to optimize the learned embedding. In the scenario of longtailed recognition, where the number of samples in each class is imbalanced, treating two types of positive samples equally leads to the biased optimization for intra-category distance. In addition, similarity relationship among negative samples, that are ignored by SCL, also presents meaningful semantic cues. To improve the performance on long-tailed recognition, this paper addresses those two issues of SCL by decoupling the training objective. Specifically, it decouples two types of positives in SCL and optimizes their relations toward different objectives to alleviate the influence of the imbalanced dataset. We further propose a patch-based self distillation to transfer knowledge from head to tail classes to relieve the under-representation of tail classes. It uses patch-based features to mine shared visual patterns among different instances and leverages a self distillation procedure to transfer such knowledge. Experiments on different long-tailed classification benchmarks demonstrate the superiority of our method. For instance, it achieves the 57.7% top-1 accuracy on the ImageNet-LT dataset. Combined with the ensemble-based method, the performance can be further boosted to 59.7%, which substantially outperforms many recent works. Our code will be released.



Paperid:709
Authors:Lulu Xue, Shengshan Hu, Ruizhi Zhao, Leo Yu Zhang, Shengqing Hu, Lichao Sun, Dezhong Yao
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Griffith University, Huazhong University of Science and Technology, Lehigh University, Huazhong University of Science and Technology
Abstract:
Collaborative learning (CL) is a distributed learning framework that aims to protect user privacy by allowing users to jointly train a model by sharing their gradient updates only. However, gradient inversion attacks (GIAs), which recover users' training data from shared gradients, impose severe privacy threats to CL. Existing defense methods adopt different techniques, e.g., differential privacy, cryptography, and perturbation defenses, to defend against the GIAs. Nevertheless, all current defense methods suffer from a poor tradeoff between privacy, utility, and efficiency. To mitigate the weaknesses of existing solutions, we propose a novel defense method, Dual Gradient Pruning (DGP), based on gradient pruning, which can improve communication efficiency while preserving the utility and privacy of CL. Specifically, DGP slightly changes gradient pruning with a stronger privacy guarantee. And DGP can also significantly improve communication efficiency with a theoretical analysis of its convergence and generalization. Our extensive experiments show that DGP can effectively defend against the most powerful GIAs and reduce the communication cost without sacrificing the model's utility.



Paperid:710
Authors:Mufan Xue, Xinyu Wu, Jinlong Li, Xuesong Li, Guoyuan Yang
Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China, Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China, Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China, School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China, Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract:
Recently, convolutional neural networks (CNNs) have become the best quantitative encoding models for capturing neural activity and hierarchical structure in the ventral visual pathway. However, the weak interpretability of these blackbox models hinders their ability to reveal visual representational encoding mechanisms. Here, we propose a convolutional neural network interpretable framework (CNN-IF) aimed at providing a transparent interpretable encoding model for the ventral visual pathway. First, we adapt the feature-weighted receptive field framework to train two high-performing ventral visual pathway encoding models using large-scale functional Magnetic Resonance Imaging (fMRI) in both goal-driven and data-driven approaches. We find that network layer-wise predictions align with the functional hierarchy of the ventral visual pathway. Then, we correspond feature units to voxel units in the brain and successfully quantify the alignment between voxel responses and visual concepts. Finally, we conduct Network Dissection along the ventral visual pathway including the fusiform face area (FFA), and discover variations related to the visual concept of `person'. Our results demonstrate the CNN-IF provides a new perspective for understanding encoding mechanisms in the human ventral visual pathway, and the combination of ante-hoc interpretable structure and post-hoc interpretable approaches can achieve fine-grained voxel-wise correspondence between model and brain. The source code is available at: https://github.com/BIT-YangLab/CNN-IF.



Paperid:711
Authors:Guoli Yan, Zichun Zhong, Jing Hua
Department of Computer Science, Wayne State University, Department of Computer Science, Wayne State University, Department of Computer Science, Wayne State University
Abstract:
Despite achieving impressive improvement in accuracy, most existing monocular 3D human mesh reconstruction methods require largescale 2D/3D ground-truths for supervision, which limits their applications on unlabeled in-the-wild data that is ubiquitous. To alleviate the reliance on 2D/3D ground-truths, we present a self-supervised 3D human pose and shape reconstruction framework that relies only on self-consistency between intermediate representations of images and projected 2D predictions. Specifically, we extract 2D joints and depth maps from monocular images as proxy inputs, which provides complementary clues to infer accurate 3D human meshes. Furthermore, to reduce the impacts from noisy and ambiguous inputs while better concentrate on the high-quality information, we design an uncertainty-aware module to automatically learn the reliability of the inputs at body-joint level based on the consistency between 2D joints and depth map. Experiments on benchmark datasets show that our approach outperforms other state-of-the-art methods at similar supervision levels.



Paperid:712
Authors:Kun Yan, Lei Ji, Chenfei Wu, Jian Liang, Ming Zhou, Nan Duan, Shuai Ma
SKLSDE Lab, Beihang University, Microsoft Research Asia, Microsoft Research Asia, Peking University, Langboat Technology, Microsoft Research Asia, SKLSDE Lab, Beihang University
Abstract:
Panorama synthesis endeavors to craft captivating 360degree visual landscapes, immersing users in the heart of virtual worlds. Nevertheless, contemporary panoramic synthesis techniques grapple with the challenge of semantically guiding the content generation process. Although recent breakthroughs in visual synthesis have unlocked the potential for semantic control in 2D flat images, a direct application of these methods to panorama synthesis yields distorted content. In this study, we unveil an innovative framework for generating high-resolution panoramas, adeptly addressing the issues of spherical distortion and edge discontinuity through sophisticated spherical modeling. Our pioneering approach empowers users with semantic control, harnessing both image and text inputs, while concurrently streamlining the generation of high-resolution panoramas using parallel decoding. We rigorously evaluate our methodology on a diverse array of indoor and outdoor datasets, establishing its superiority over recent related work, in terms of both quantitative and qualitative performance metrics. Our research elevates the controllability, efficiency, and fidelity of panorama synthesis to new levels.



Paperid:713
Authors:Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, Fei Deng
Wuhan University, Harbin Institute of Technology (Shenzhen), XGRIDS, Hong Kong Baptist University, The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology, Wuhan University Hubei Luojia Laboratory
Abstract:
Neural Radiance Fields have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2GNeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel camera parameter free neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion. Given a sequence of images, CF-NeRF estimates camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset, NeRFBuster, which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to rotation and achieves state-of-the-art results without providing prior information and constraints.



Paperid:714
Authors:Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao
Fudan Unversity, Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, The Chinese University of Hong Kong, Fudan University, Fudan University, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Peking University, China Telecom, Shanghai Artificial Intelligence Laboratory
Abstract:
Recently, video object segmentation (VOS) referred by multimodal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.



Paperid:715
Authors:Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, Yuexian Zou
Peking University Pengcheng Laboratory, Pengcheng Laboratory, Peking University, Peking University Pengcheng Laboratory, Peking University, Peking University
Abstract:
While visionlanguage pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at https://github.com/yangbang18/CLFM.



Paperid:716
Authors:Fan Yang, Hui Chen, Yuwei He, Sicheng Zhao, Chenghao Zhang, Kai Ni, Guiguang Ding
Tsinghua University BNRist Hangzhou Zhuoxi Institute of Brain and Intelligence, Tsinghua University BNRist, Tsinghua University BNRist, Tsinghua University BNRist, Tsinghua University BNRist, HoloMatic Technology, Tsinghua University BNRist
Abstract:
Monocular 3D object detection (M3OD) is important for autonomous driving. However, existing deep learningbased methods easily suffer from performance degradation in real-world scenarios due to the substantial domain gap between training and testing. M3OD's domain gaps are complex, including camera intrinsic parameters, extrinsic parameters, image appearance, etc. Existing works primarily focus on the domain gaps of camera intrinsic parameters, ignoring other key factors. Moreover, at the feature level, conventional domain invariant learning methods generally cause the negative transfer issue, due to the ignorance of dependency between geometry tasks and domains. To tackle these issues, in this paper, we propose MonoGDG, a geometry-guided domain generalization framework for M3OD, which effectively addresses the domain gap at both camera and feature levels. Specifically, MonoGDG consists of two major components. One is geometry-based image reprojection, which mitigates the impact of camera discrepancy by unifying intrinsic parameters, randomizing camera orientations, and unifying the field of view range. The other is geometry-dependent feature disentanglement, which overcomes the negative transfer problems by incorporating domain-shared and domain-specific features. Additionally, we leverage a depth-disentangled domain discriminator and a domain-aware geometry regression attention mechanism to account for the geometry-domain dependency. Extensive experiments on multiple autonomous driving benchmarks demonstrate that our method achieves state-of-the-art performance in domain generalization for M3OD.



Paperid:717
Authors:Fengxiang Yang, Zhun Zhong, Zhiming Luo, Yifan He, Shaozi Li, Nicu Sebe
Department of Artificial Intelligence, Xiamen University, China Department of Information Engineering and Computer Science, University of Trento, Italy, School of Computer Science, University of Nottingham, UK, Department of Artificial Intelligence, Xiamen University, China, Reconova Technologies Co., Ltd., China, Department of Artificial Intelligence, Xiamen University, China, Department of Information Engineering and Computer Science, University of Trento, Italy
Abstract:
This paper tackles the problem of federated domain generalization in person reidentification (FedDG re-ID), aiming to learn a model generalizable to unseen domains with decentralized source domains. Previous methods mainly focus on preventing local overfitting. However, the direction of diversifying local data through stylization for model training is largely overlooked. This direction is popular in domain generalization but will encounter two issues under federated scenario: (1) Most stylization methods require the centralization of multiple domains to generate novel styles but this is not applicable under decentralized constraint. (2) The authenticity of generated data cannot be ensured especially given limited local data, which may impair the model optimization. To solve these two problems, we propose the Diversity-Authenticity Co-constrained Stylization (DACS), which can generate diverse and authentic data for learning robust local model. Specifically, we deploy a style transformation model on each domain to generate novel data with two constraints: (1) A diversity constraint is designed to increase data diversity, which enlarges the Wasserstein distance between the original and transformed data; (2) An authenticity constraint is proposed to ensure data authenticity, which enforces the transformed data to be easily/hardly recognized by the local-side global/local model. Extensive experiments demonstrate the effectiveness of the proposed DACS and show that DACS achieves state-of-the-art performance for FedDG re-ID.



Paperid:718
Authors:Guo-Ye Yang, George Kiyohiro Nakayama, Zi-Kai Xiao, Tai-Jiang Mu, Xiaolei Huang, Shi-Min Hu
BNRist Department of Computer Science and Technology, Tsinghua University, Stanford University, BNRist Department of Computer Science and Technology, Tsinghua University, BNRist Department of Computer Science and Technology, Tsinghua University, College of Information Sciences and Technology, Pennsylvania State University, BNRist Department of Computer Science and Technology, Tsinghua University
Abstract:
Great progress has been made in learningbased object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (RoI) feature extractors which extract transformation-invariant RoI features for different RoI proposals, making refinement of bounding boxes and prediction of object categories more robust and accurate. However, previous RoI feature extractors can only extract invariant features under limited transformations. In this paper, we propose a novel RoI feature extractor, termed Semantic RoI Align (SRA), which is capable of extracting invariant RoI features under a variety of transformations for two-stage detectors. Specifically, we propose a semantic attention module to adaptively determine different sampling areas by leveraging the global and local semantic relationship within the RoI. We also propose a Dynamic Feature Sampler which dynamically samples features based on the RoI aspect ratio to enhance the efficiency of SRA, and a new position embedding, i.e., Area Embedding, to provide more accurate position information for SRA through an improved sampling area representation. Experiments show that our model significantly outperforms baseline models with slight computational overhead. In addition, it shows excellent generalization ability and can be used to improve performance with various state-of-the-art backbones and detection methods. The code is available at https://github.com/cxjyxxme/SemanticRoIAlign.



Paperid:719
Authors:Hunmin Yang, Jongoh Jeong, Kuk-Jin Yoon
KAIST ADD, KAIST, KAIST
Abstract:
Deep neural networks are known to be vulnerable to security risks due to the inherent transferable nature of adversarial examples. Despite the success of recent generative modelbased attacks demonstrating strong transferability, it still remains a challenge to design an efficient attack strategy in a real-world strict black-box setting, where both the target domain and model architectures are unknown. In this paper, we seek to explore a feature contrastive approach in the frequency domain to generate adversarial examples that are robust in both cross-domain and cross-model settings. With that goal in mind, we propose two modules that are only employed during the training phase: a Frequency-Aware Domain Randomization (FADR) module to randomize domain-variant low- and high-range frequency components and a Frequency-Augmented Contrastive Learning (FACL) module to effectively separate domain-invariant mid-frequency features of clean and perturbed image. We demonstrate strong transferability of our generated adversarial perturbations through extensive cross-domain and cross-model experiments, while keeping the inference time complexity.



Paperid:720
Authors:Mingzhan Yang, Guangxin Han, Bin Yan, Wenhua Zhang, Jinqing Qi, Huchuan Lu, Dong Wang
Dalian University of Technology Shenzhen Tvt Digital Technology Co., Ltd, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology
Abstract:
MultiObject Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, spatial and appearance information will become ambiguous simultaneously due to the high overlap among objects. In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Also, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, with both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and severe occlusion frequently happen with complex motions. The code and models are available at https://github.com/ymzis69/HybridSORT.



Paperid:721
Authors:Shuo Yang, Yongqi Wang, Xiaofeng Ji, Xinxiao Wu
Shenzhen MSU-BIT University Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology Shenzhen MSU-BIT University
Abstract:
Openvocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.



Paperid:722
Authors:Songlin Yang, Wei Wang, Yushi Lan, Xiangyu Fan, Bo Peng, Lei Yang, Jing Dong
School of Artificial Intelligence, University of Chinese Academy of Sciences, China CRIPAC&MAIS, Institute of Automation, Chinese Academy of Sciences, China, CRIPAC&MAIS, Institute of Automation, Chinese Academy of Sciences, China, Nanyang Technological University, Singapore, SenseTime, China, CRIPAC&MAIS, Institute of Automation, Chinese Academy of Sciences, China, SenseTime, China, CRIPAC&MAIS, Institute of Automation, Chinese Academy of Sciences, China
Abstract:
Face reenactment is challenging due to the need to establish dense correspondence between various face representations for motion transfer. Recent studies have utilized Neural Radiance Field (NeRF) as fundamental representation, which further enhanced the performance of multiview face reenactment in photo-realism and 3D consistency. However, establishing dense correspondence between different face NeRFs is non-trivial, because implicit representations lack ground-truth correspondence annotations like mesh-based 3D parametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning 3DMM space with NeRF-based face representations can realize motion control, it is sub-optimal for their limited face-only modeling and low identity fidelity. Therefore, we are inspired to ask: Can we learn the dense correspondence between different NeRF-based face representations without a 3D parametric model prior? To address this challenge, we propose a novel framework, which adopts tri-planes as fundamental NeRF representation and decomposes face tri-planes into three components: canonical tri-planes, identity deformations, and motion. In terms of motion control, our key contribution is proposing a Plane Dictionary (PlaneDict) module, which efficiently maps the motion conditions to a linear weighted addition of learnable orthogonal plane bases. To the best of our knowledge, our framework is the first method that achieves one-shot multi-view face reenactment without a 3D parametric model prior. Extensive experiments demonstrate that we produce better results in fine-grained motion control and identity preservation than previous methods.



Paperid:723
Authors:Wen Yang, Jinjian Wu, Jupo Ma, Leida Li, Guangming Shi
Xidian University Pazhou Lab, Huangpu, Xidian University Pazhou Lab, Huangpu, Xidian University Pazhou Lab, Huangpu, Xidian University, Xidian University Pazhou Lab, Huangpu
Abstract:
Motion deblurring can be advanced by exploiting informative features from supplementary sensors such as event cameras, which can capture rich motion information asynchronously with high temporal resolution. Existing eventbased motion deblurring methods neither consider the modality redundancy in spatial fusion nor temporal cooperation between events and frames. To tackle these limitations, a novel spatial-temporal collaboration network (STCNet) is proposed for event-based motion deblurring. Firstly, we propose a differential-modality based cross-modal calibration strategy to suppress redundancy for complementarity enhancement, and then bimodal spatial fusion is achieved with an elaborate cross-modal co-attention mechanism to weight the contributions of them for importance balance. Besides, we present a frame-event mutual spatio-temporal attention scheme to alleviate the errors of relying only on frames to compute cross-temporal similarities when the motion blur is significant, and then the spatio-temporal features from both frames and events are aggregated with the custom cross-temporal coordinate attention. Extensive experiments on both synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance. Project website: https://github.com/wyang-vis/STCNet.



Paperid:724
Authors:Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang
University of Technology, Sydney, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
Textvideo retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL.



Paperid:725
Authors:Xiaofeng Yang, Fayao Liu, Yi Xu, Hanjing Su, Qingyao Wu, Guosheng Lin
Nanyang Technological University, Singapore, Institute for Infocomm Research, A*STAR, Singapore, OPPO US Research Center, USA, Tencent, China, South China University of Technology, China, Nanyang Technological University, Singapore
Abstract:
In recent years, following the success of text guided image generation, text guided 3D generation has gained increasing attention among researchers. Dreamfusion is a notable approach that enhances generation quality by utilizing 2D text guided diffusion models and introducing SDS loss, a technique for distilling 2D diffusion model information to train 3D models. However, the SDS loss has two major limitations that hinder its effectiveness. Firstly, when given a text prompt, the SDS loss struggles to produce diverse content. Secondly, during training, SDS loss may cause the generated content to overfit and collapse, limiting the model's ability to learn intricate texture details. To overcome these challenges, we propose a novel approach called Noise Recalibration algorithm. By incorporating this technique, we can generate 3D content with significantly greater diversity and stunning details. Our approach offers a promising solution to the limitations of SDS loss.



Paperid:726
Authors:Xin Yang, Wending Yan, Yuan Yuan, Michael Bi Mi, Robby T. Tan
National University of Singapore, Huawei International Pte Ltd, Huawei International Pte Ltd, Huawei International Pte Ltd, National University of Singapore
Abstract:
Semantic segmentation's performance is often compromised when applied to unlabeled adverse weather conditions. Unsupervised domain adaptation is a potential approach to enhancing the model's adaptability and robustness to adverse weather. However, existing methods encounter difficulties when sequentially adapting the model to multiple unlabeled adverse weather conditions. They struggle to acquire new knowledge while also retaining previously learned knowledge. To address these problems, we propose a semantic segmentation method for multiple adverse weather conditions that incorporates adaptive knowledge acquisition, pseudolabel blending, and weather composition replay. Our adaptive knowledge acquisition enables the model to avoid learning from extreme images that could potentially cause the model to forget. In our approach of blending pseudo-labels, we not only utilize the current model but also integrate the previously learned model into the ongoing learning process. This collaboration between the current teacher and the previous model enhances the robustness of the pseudo-labels for the current target. Our weather composition replay mechanism allows the model to continuously refine its previously learned weather information while simultaneously learning from the new target domain. Our method consistently outperforms the state-of-the-art methods, and obtains the best performance with averaged mIoU (%) of 65.7 and the lowest forgetting (%) of 3.6 against 60.1 and 11.3, on the ACDC datsets for a four-target continual multi-target domain adaptation.



Paperid:727
Authors:Xingxing Yang, Jie Chen, Zaifeng Yang
Hong Kong Baptist University, Hong Kong Baptist University, A-STAR Singapore
Abstract:
Existing learningbased hyperspectral reconstruction methods show limitations in fully exploiting the information among the hyperspectral bands. As such, we propose to investigate the chromatic inter-dependencies in their respective hyperspectral embedding space. These embedded features can be fully exploited by querying the inter-channel correlations in a combinatorial manner, with the unique and complementary information efficiently fused into the final prediction. We found such independent modeling and combinatorial excavation mechanisms are extremely beneficial to uncover marginal spectral features, especially in the long wavelength bands. In addition, we have proposed a spatio-spectral attention block and a spectrum-fusion attention module, which greatly facilitates the excavation and fusion of information at both semantically long-range levels and fine-grained pixel levels across all dimensions. Extensive quantitative and qualitative experiments show that our method (dubbed CESST) achieves SOTA performance. Code for this project is at: https://github.com/AlexYangxx/CESST.



Paperid:728
Authors:Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang
Wuhan University Hubei Luojia Laboratory, JD.com, Renmin University of China, Wuhan University Hubei Luojia Laboratory, JD.com, The University of Sydney
Abstract:
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift Network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new stateof-the-art performance. The code is available at https://github.com/starxing-yuu/SSN.



Paperid:729
Authors:Yaokun Yang, Yihan Yin, Feng Lu
Beihang University, Beihang University, Beihang University
Abstract:
Despite achieving impressive performance, current methods for detecting gaze targets, which depend on visual saliency and spatial scene geometry, continue to face challenges when it comes to detecting gaze targets within intricate image backgrounds. One of the primary reasons for this lies in the oversight of the intricate connection between human attention and activity cues. In this study, we introduce an innovative approach that amalgamates the visual saliency detection with the bodypart & object interaction both guided by the soft gaze attention. This fusion enables precise and dependable detection of gaze targets amidst intricate image backgrounds. Our approach attains state-of-the-art performance on both the Gazefollow benchmark and the GazeVideoAttn benchmark. In comparison to recent methods that rely on intricate 3D reconstruction of a single input image, our approach, which solely leverages 2D image information, still exhibits a substantial lead across all evaluation metrics, positioning it closer to human-level performance. These outcomes underscore the potent effectiveness of our proposed method in the gaze target detection task.



Paperid:730
Authors:Yiying Yang, Fukun Yin, Wen Liu, Jiayuan Fan, Xin Chen, Gang Yu, Tao Chen
Fudan university, Fudan University, Tencent, Fudan University, Tencent, Tencent, Fudan University
Abstract:
Recent advancements in implicit neural representations have contributed to highfidelity surface reconstruction and photorealistic novel view synthesis. However, with the expansion of the scene scale, such as block or city level, existing methods will encounter challenges because traditional sampling cannot cope with the cubically growing sampling space. To alleviate the dependence on filling the sampling space, we explore using multi-modal priors to assist individual points to obtain more global semantic information and propose a priorrich multi-modal implicit neural representation network, Pm-INR, for the outdoor unbounded large-scale scene. The core of our method is multi-modal prior extraction and crossmodal prior fusion modules. The former encodes codebooks from different modality inputs and extracts valuable priors, while the latter fuses priors to maintain view consistency and preserve unique features among multi-modal priors. Finally, feature-rich cross-modal priors are injected into the sampling regions to allow each region to perceive global information without filling the sampling space. Extensive experiments have demonstrated the effectiveness and robustness of our method for outdoor unbounded large-scale scene novel view synthesis, which outperforms state-of-the-art methods in terms of PSNR, SSIM, and LPIPS.



Paperid:731
Authors:Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin
South China University of Technology, South China University of Technology, South China University of Technology, South China University of Technology, Alibaba Group, South China University of Technology SCUT-Zhuhai Institute of Modern Industrial Innovation
Abstract:
Automatic font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images. Although existing font generation methods have achieved satisfactory performance, they still struggle with complex characters and large style variations. To address these issues, we propose FontDiffuser, a diffusionbased image-to-image one-shot font generation method, which innovatively models the font imitation task as a noise-to-denoise paradigm. In our method, we introduce a Multi-scale Content Aggregation (MCA) block, which effectively combines global and local content cues across different scales, leading to enhanced preservation of intricate strokes of complex characters. Moreover, to better manage the large variations in style transfer, we propose a Style Contrastive Refinement (SCR) module, which is a novel structure for style representation learning. It utilizes a style extractor to disentangle styles from images, subsequently supervising the diffusion model via a meticulously designed style contrastive loss. Extensive experiments demonstrate FontDiffuser's state-of-the-art performance in generating diverse characters and styles. It consistently excels on complex characters and large style changes compared to previous methods. The code is available at https://github.com/yeungchenwa/FontDiffuser.



Paperid:732
Authors:Feiyu Yao, Zongkai Wu, Li Yi
2012 Lab, Huawei Technologies Co., Ltd, Fancy Technology, Tsinghua University Shanghai Artificial Intelligence Laboratory Shanghai Qi Zhi Institute
Abstract:
Estimating 3D fullbody pose from sparse sensor data is a pivotal technique employed for the reconstruction of realistic human motions in Augmented Reality and Virtual Reality. However, translating sparse sensor signals into comprehensive human motion remains a challenge since the sparsely distributed sensors in common VR systems fail to capture the motion of full human body. In this paper, we use well-designed Body Pose Graph (BPG) to represent the human body and translate the challenge into a prediction problem of graph missing nodes. Then, we propose a novel full-body motion reconstruction framework based on BPG. To establish BPG, nodes are initially endowed with features extracted from sparse sensor signals. Features from identifiable joint nodes across diverse sensors are amalgamated and processed from both temporal and spatial perspectives. Temporal dynamics are captured using the Temporal Pyramid Structure, while spatial relations in joint movements inform the spatial attributes. The resultant features serve as the foundational elements of the BPG nodes. To further refine the BPG, node features are updated through a graph neural network that incorporates edge reflecting varying joint relations. Our method's effectiveness is evidenced by the attained state-of-the-art performance, particularly in lower body motion, outperforming other baseline methods. Additionally, an ablation study validates the efficacy of each module in our proposed framework.



Paperid:733
Authors:Lujian Yao, Haitao Zhao, Jingchao Peng, Zhongze Wang, Kaijie Zhao
East China University of Science and Technology, East China University of Science and Technology, East China University of Science and Technology, East China University of Science and Technology, East China University of Science and Technology
Abstract:
Early smoke segmentation (ESS) enables the accurate identification of smoke sources, facilitating the prompt extinguishing of fires and preventing largescale gas leaks. But ESS poses greater challenges than conventional object and regular smoke segmentation due to its small scale and transparent appearance, which can result in high miss detection rate and low precision. To address these issues, a Focus and Separation Network (FoSp) is proposed. We first introduce a Focus module employing bidirectional cascade which guides low-resolution and high-resolution features towards mid-resolution to locate and determine the scope of smoke, reducing the miss detection rate. Next, we propose a Separation module that separates smoke images into a pure smoke foreground and a smoke-free background, enhancing the contrast between smoke and background fundamentally, improving segmentation precision. Finally, a Domain Fusion module is developed to integrate the distinctive features of the two modules which can balance recall and precision to achieve high F_beta. Futhermore, to promote the development of ESS, we introduce a high-quality real-world dataset called SmokeSeg, which contains more small and transparent smoke than the existing datasets. Experimental results show that our model achieves the best performance on three available smoke segmentation datasets: SYN70K (mIoU: 83.00%), SMOKE5K (F_beta: 81.6%) and SmokeSeg (F_beta: 72.05%). The code can be found at https://github.com/LujianYao/FoSp.



Paperid:734
Authors:Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao, Chunxin Fang, Kyusong Lee, Qing Wang
Northwestern Polytechnical University, Linker Technology Research Co. Ltd, Binjiang Institute of Zhejiang University, Linker Technology Research Co. Ltd, Binjiang Institute of Zhejiang University, Binjiang Institute of Zhejiang University, Binjiang Institute of Zhejiang University, Northwestern Polytechnical University
Abstract:
Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closedset labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at https://github.com/om-ai-lab/OVDEval



Paperid:735
Authors:Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi
The Hebrew University of Jerusalem NetApp, Technion, The Hebrew University of Jerusalem, Tel Aviv University, Tel Aviv University NetApp, The Hebrew University of Jerusalem
Abstract:
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing textconditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/.



Paperid:736
Authors:Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu
Beijing University of Posts and Telecommunications Beijing Academy of Artificial Intelligence, Beijing Academy of Artificial Intelligence, Beijing Academy of Artificial Intelligence, Beijing Academy of Artificial Intelligence
Abstract:
Large Textto-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.



Paperid:737
Authors:Jingwen Ye, Ruonan Yu, Songhua Liu, Xinchao Wang
National University of Singapore, National University of Singapore, National University of Singapore, National University of Singapore
Abstract:
Adversarial attacks constitute a notable threat to machine learning systems, given their potential to induce erroneous predictions and classifications. However, within realworld contexts, the essential specifics of the deployed model are frequently treated as a black box, consequently mitigating the vulnerability to such attacks. Thus, enhancing the transferability of the adversarial samples has become a crucial area of research, which heavily relies on selecting appropriate surrogate models. To address this challenge, we propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach is accomplished by leveraging the pre-trained CLIP model. Firstly, we conduct a visual attack on the clean image that causes semantic perturbations on the aligned embedding space with the other textual modality. Then, we apply the corresponding defense on the textual modality by updating the prompts, which forces the re-matching on the perturbed embedding space. Finally, to enhance the attack transferability, we utilize the iterative training strategy on the visual attack and the textual defense, where the two processes optimize from each other. We evaluate our approach on several benchmark datasets and demonstrate that our mutual-modal attack strategy can effectively produce high-transferable attacks, which are stable regardless of the target networks. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.



Paperid:738
Authors:Xi Ye, Guillaume-Alexandre Bilodeau
Polytechnique Montréal, Polytechnique Montréal
Abstract:
Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinitedimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at https://github.com/XiYe20/STDiffProject.



Paperid:739
Authors:Yunfan Ye, Kai Xu, Yuhang Huang, Renjiao Yi, Zhiping Cai
Hunan University National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology
Abstract:
Limited by the encoderdecoder architecture, learning-based edge detectors usually have difficulty predicting edge maps that satisfy both correctness and crispness. With the recent success of the diffusion probabilistic model (DPM), we found it is especially suitable for accurate and crisp edge detection since the denoising process is directly applied to the original image size. Therefore, we propose the first diffusion model for the task of general edge detection, which we call DiffusionEdge. To avoid expensive computational resources while retaining the final performance, we apply DPM in the latent space and enable the classic cross-entropy loss which is uncertainty-aware in pixel level to directly optimize the parameters in latent space in a distillation manner. We also adopt a decoupled architecture to speed up the denoising process and propose a corresponding adaptive Fourier filter to adjust the latent features of specific frequencies. With all the technical designs, DiffusionEdge can be stably trained with limited resources, predicting crisp and accurate edge maps with much fewer augmentation strategies. Extensive experiments on four edge detection benchmarks demonstrate the superiority of DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset, compared to the second best, we increase the ODS, OIS (without post-processing) and AC by 30.2%, 28.1% and 65.1%, respectively. Code: https://github.com/GuHuangAI/DiffusionEdge.



Paperid:740
Authors:YuTeng Ye, Hang Zhou, Jiale Cai, Chenxing Gao, Youjia Zhang, Junle Wang, Qiang Hu, Junqing Yu, Wei Yang
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Tencent, Shanghai Jiao Tong University, Huazhong University of Science and Technology, Huazhong University of Science and Technology
Abstract:
Occluded person reidentification (ReID) is a challenging problem due to contamination from occluders. Existing approaches address the issue with prior knowledge cues, such as human body key points and semantic segmentations, which easily fail in the presence of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parsing. The framework mainly consists of a sparse encoder, a multi-view feature mathcing module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens, mostly related to background noise and occluders, solely based on correlation within the class token attention. Subsequently, the matching stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors in the gallery by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial, and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.



Paperid:741
Authors:YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology
Abstract:
In spite of the rapidly evolving landscape of textto-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations—namely insertion, editing, and erasing—we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.



Paperid:742
Authors:Kefu Yi, Kai Luo, Xiaolei Luo, Jiangui Huang, Hao Wu, Rongdong Hu, Wei Hao
School of Traffic and Transportation, Changsha University of Science and Technology, College of Automotive and Mechanical Engineering, Changsha University of Science and Technology, College of Automotive and Mechanical Engineering, Changsha University of Science and Technology, College of Automotive and Mechanical Engineering, Changsha University of Science and Technology, College of Automotive and Mechanical Engineering, Changsha University of Science and Technology, Changsha Intelligent Driving Institute, School of Traffic and Transportation, Changsha University of Science and Technology
Abstract:
Multiobject tracking (MOT) in video sequences remains a challenging task, especially in scenarios with significant camera movements. This is because targets can drift considerably on the image plane, leading to erroneous tracking outcomes. Addressing such challenges typically requires supplementary appearance cues or Camera Motion Compensation (CMC). While these strategies are effective, they also introduce a considerable computational burden, posing challenges for real-time MOT. In response to this, we introduce UCMCTrack, a novel motion model-based tracker robust to camera movements. Unlike conventional CMC that computes compensation parameters frame-by-frame, UCMCTrack consistently applies the same compensation parameters throughout a video sequence. It employs a Kalman filter on the ground plane and introduces the Mapped Mahalanobis Distance (MMD) as an alternative to the traditional Intersection over Union (IoU) distance measure. By leveraging projected probability distributions on the ground plane, our approach efficiently captures motion patterns and adeptly manages uncertainties introduced by homography projections. Remarkably, UCMCTrack, relying solely on motion cues, achieves state-of-the-art performance across a variety of challenging datasets, including MOT17, MOT20, DanceTrack and KITTI. More details and code are available at https://github.com/corfyi/UCMCTrack.



Paperid:743
Authors:Mingxin Yi, Kai Zhang, Pei Liu, Tanli Zuo, Jingduo Tian
Tsinghua Shenzhen International Graduate School, Tsinghua University, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, China Research Institute of Tsinghua, Pearl River Delta, Media Technology Lab, Huawei, China, Media Technology Lab, Huawei, China, Media Technology Lab, Huawei, China
Abstract:
Deriving DSLRquality sRGB images from smartphone RAW images has become a compelling challenge due to discernible detail disparity, color mapping instability, and spatial misalignment in RAW-sRGB data pairs. We present DiffRAW, a novel method that incorporates the diffusion model for the first time in learning RAW-to-sRGB mappings. By leveraging the diffusion model, our approach effectively learns the high-quality detail distribution of DSLR images, thereby enhancing the details of output images. Simultaneously, we use the RAW image as a diffusion condition to maintain image structure information such as contours and textures. To mitigate the interference caused by the color and spatial misalignment in training data pairs, we embed a color-position preserving condition within DiffRAW, ensuring that the output images do not exhibit color biases and pixel shift issues. To accelerate the inference process of DiffRAW, we designed the Domain Transform Diffusion Method, an efficient diffusion process with its corresponding reverse process. The Domain Transform Diffusion Method can reduce the required inference steps for diffusion model-based image restoration/enhancement algorithms while enhancing the quality of the generated images. Through evaluations on the ZRR dataset, DiffRAW consistently demonstrates state-of-the-art performance across all perceptual quality metrics (e.g., LPIPS, FID, MUSIQ), while achieving comparable results in PSNR and SSIM.



Paperid:744
Authors:Kai Yin, Jie Shen
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
The lookup table (LUT) has recently shown its practicability and effectiveness in super-resolution (SR) tasks due to its low computational cost and hardware independence. However, most existing methods focus on improving the performance of SR, neglecting the demand for high-speed SR on low-computational edge devices. In this paper, we propose an efficient expanded convolution (EC) layer, which expands the output size of regular convolution to enlarge the receptive field (RF) indirectly. It can increase the size of the LUT corresponding to the network linearly with the increase of RF. Additionally, after introducing the EC, multiple LUTs are merged into one LUT, achieving faster running speed while maintaining SR performance. More specifically, we expand the coverage of the convolutional output so that the output at the current position covers the target position and its surroundings, forming an overlapping sliding window at the output end. We sum up the overlapping parts of the sliding window as the output, thereby achieving the effect of enlarging the RF size. Moreover, by expanding the numerical range of the accumulated results and rescaling them to [0,255], the method can mitigate the error caused by quantization output. Experiments indicate that the proposed method performs better than the baseline method and is faster than other LUT-based SR methods.



Paperid:745
Authors:Pengwei Yin, Guanzhong Zeng, Jingjing Wang, Di Xie
Hikvision Research Institute, Hikvision Research Institute, Hikvision Research Institute, Hikvision Research Institute
Abstract:
Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIPGaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.



Paperid:746
Authors:Xingyilang Yin, Xi Yang, Liangchen Liu, Nannan Wang, Xinbo Gao
Xidian University, Xidian University, Xidian University, Xidian University, Chongqing University of Posts and Telecommunications
Abstract:
Recently MLPbased methods have shown strong performance in point cloud analysis. Simple MLP architectures are able to learn geometric features in local point groups yet fail to model long-range dependencies directly. In this paper, we propose Point Deformable Network (PDNet), a concise MLP-based network that can capture long-range relations with strong representation ability. Specifically, we put forward Point Deformable Aggregation Module (PDAM) to improve representation capability in both long-range dependency and adaptive aggregation among points. For each query point, PDAM aggregates information from deformable reference points rather than points in limited local areas. The deformable reference points are generated data-dependent, and we initialize them according to the input point positions. Additional offsets and modulation scalars are learned on the whole point features, which shift the deformable reference points to the regions of interest. We also suggest estimating the normal vector for point clouds and applying Enhanced Normal Embedding (ENE) to the geometric extractors to improve the representation ability of single-point. Extensive experiments and ablation studies on various benchmarks demonstrate the effectiveness and superiority of our PDNet.



Paperid:747
Authors:Yufei Yin, Hao Chen, Wengang Zhou, Jiajun Deng, Haiming Xu, Houqiang Li
CAS Key Laboratory of Technology in GIPAS, EEIS Department, University of Science and Technology of China, Zhejiang University, CAS Key Laboratory of Technology in GIPAS, EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Australian Institute for Machine Learning, University of Adelaide, Australian Institute for Machine Learning, University of Adelaide, CAS Key Laboratory of Technology in GIPAS, EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
In this paper, we focus on the openset panoptic segmentation (OPS) task to circumvent the data explosion problem. Different from the close-set setting, OPS targets to detect both known and unknown categories, where the latter is not annotated during training. Different from existing work that only selects a few common categories as unknown ones, we move forward to the real-world scenario by considering the various tail categories (~1k). To this end, we first build a new dataset with long-tail distribution for the OPS task. Based on this dataset, we additionally add a new class type for unknown classes and re-define the training annotations to make the OPS definition more complete and reasonable. Moreover, we analyze the influence of several significant factors in the OPS task and explore the upper bound of performance on unknown classes with different settings. Furthermore, based on the analyses, we design an effective two-phase framework for the OPS task, including thing-agnostic map generation and unknown segment mining. We further adopt semi-supervised learning to improve the OPS performance. Experimental results on different datasets validate the effectiveness of our method.



Paperid:748
Authors:Ziyi Yin, Muchao Ye, Tianrong Zhang, Jiaqi Wang, Han Liu, Jinghui Chen, Ting Wang, Fenglong Ma
The Pennsylvania State University, The Pennsylvania State University, The Pennsylvania State University, The Pennsylvania State University, Dalian University of Technology, The Pennsylvania State University, Stony Brook University, The Pennsylvania State University
Abstract:
Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the “pretraining & finetuning” learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQATTACK model, which can iteratively generate both im- age and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQATTACK in the transferable attack setting, compared with state-of-the-art baselines. This work reveals a significant blind spot in the “pre-training & fine-tuning” paradigm on VQA tasks. The source code can be found in the link https://github.com/ericyinyzy/VQAttack.



Paperid:749
Authors:Chenyang Yu, Xuehu Liu, Yingquan Wang, Pingping Zhang, Huchuan Lu
School of Information and Communication Engineering, Dalian University of Technology, Dalian, China, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China, School of Information and Communication Engineering, Dalian University of Technology, Dalian, China, School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, Dalian, China, School of Information and Communication Engineering, Dalian University of Technology, Dalian, China School of Future Technology, School of Artificial Intelligence, Dalian University of Technology, Dalian, China Ningbo Institute, Dalian University of Technology, Ningbo, China
Abstract:
Largescale language-image pre-trained models (e.g., CLIP) have shown superior performances on many cross-modal retrieval tasks. However, the problem of transferring the knowledge learned from such models to video-based person re-identification (ReID) has barely been explored. In addition, there is a lack of decent text descriptions in current ReID benchmarks. To address these issues, in this work, we propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID. More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature. Meanwhile, we design a Sequence-Specific Prompt (SSP) module to update the CLIP-Memory online. To capture temporal information, we further propose a Temporal Memory Diffusion (TMD) module, which consists of two key components: Temporal Memory Construction (TMC) and Memory Diffusion (MD). Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence. MD further diffuses the temporal memories to each token in the original features to obtain more robust sequence features. Extensive experiments demonstrate that our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID.



Paperid:750
Authors:Hai-Tao Yu, Mofei Song
Southeast University, Southeast University
Abstract:
In perception, multiple sensory information is integrated to map visual information from 2D views onto 3D objects, which is beneficial for understanding in 3D environments. But in terms of a single 2D view rendered from different angles, only limited partial information can be provided. The richness and value of Multiview 2D information can provide superior self-supervised signals for 3D objects. In this paper, we propose a novel self-supervised point cloud representation learning method, MM-Point, which is driven by intra-modal and inter-modal similarity objectives. The core of MM-Point lies in the Multi-modal interaction and transmission between 3D objects and multiple 2D views at the same time. In order to more effectively simultaneously perform the consistent cross-modal objective of 2D multi-view information based on contrastive learning, we further propose Multi-MLP and Multi-level Augmentation strategies. Through carefully designed transformation strategies, we further learn Multi-level invariance in 2D Multi-views. MM-Point demonstrates state-of-the-art (SOTA) performance in various downstream tasks. For instance, it achieves a peak accuracy of 92.4% on the synthetic dataset ModelNet40, and a top accuracy of 87.8% on the real-world dataset ScanObjectNN, comparable to fully supervised methods. Additionally, we demonstrate its effectiveness in tasks such as few-shot classification, 3D part segmentation and 3D semantic segmentation.



Paperid:751
Authors:Hongtian Yu, Yunjie Tian, Qixiang Ye, Yunfan Liu
University of Chinese Academy of Sciences, University of Chinese Academy of Sciences, University of Chinese Academy of Sciences, University of Chinese Academy of Sciences
Abstract:
Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotationsensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.



Paperid:752
Authors:Hongwei Yu, Jiansheng Chen, Xinlong Ding, Yudong Zhang, Ting Tang, Huimin Ma
University of Science and Technology Beijing, University of Science and Technology Beijing, University of Science and Technology Beijing, Tsinghua University, University of Science and Technology Beijing, University of Science and Technology Beijing
Abstract:
The highquality generation results of conditional diffusion models have brought about concerns regarding privacy and copyright issues. As a possible technique for preventing the abuse of diffusion models, the adversarial attack against diffusion models has attracted academic attention recently. In this work, utilizing the phenomenon that diffusion models are highly sensitive to the mean value of the input noise, we propose the Mean Fluctuation Attack (MFA) to introduce mean fluctuations by shifting the mean values of the estimated noises during the reverse process. In addition, we reveal that the vulnerability of different reverse steps against adversarial attacks actually varies significantly. By modeling the step vulnerability and using it as guidance to sample the target steps for generating adversarial examples, the effectiveness of adversarial attacks can be substantially enhanced. Extensive experiments show that our algorithm can steadily cause the mean shift of the predicted noises so as to disrupt the entire reverse generation process and degrade the generation results significantly. We also demonstrate that the step vulnerability is intrinsic to the reverse process by verifying its effectiveness in an attack method other than MFA. Code and Supplementary is available at https://github.com/yuhongwei22/MFA



Paperid:753
Authors:Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu
University of Sydney, Shanghai AI Laboratory, Nanyang Technological University, Nanyang Technological University, University of Sydney, Shanghai AI Laboratory
Abstract:
Recent advances in zeroshot text-to-3D human generation, which employ the human model prior (e.g., SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leveraging existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two perspectives. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guide to ensure that the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art (SoTA) methods, validate the efficacy of our approach.Project page: https://painthuman.github.io/.



Paperid:754
Authors:Sheng Yu, Di-Hua Zhai, Yuanqing Xia
School of Automation, Beijing Institute of Technology, Beijing, China, School of Automation, Beijing Institute of Technology, Beijing, China Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China, School of Automation, Beijing Institute of Technology, Beijing, China
Abstract:
Although there has been significant progress in categorylevel object pose estimation in recent years, there is still considerable room for improvement. In this paper, we propose a novel transformer-based category-level 6D pose estimation method called CatFormer to enhance the accuracy pose estimation. CatFormer comprises three main parts: a coarse deformation part, a fine deformation part, and a recurrent refinement part. In the coarse and fine deformation sections, we introduce a transformer-based deformation module that performs point cloud deformation and completion in the feature space. Additionally, after each deformation, we incorporate a transformer-based graph module to adjust fused features and establish geometric and topological relationships between points based on these features. Furthermore, we present an end-to-end recurrent refinement module that enables the prior point cloud to deform multiple times according to real scene features. We evaluate CatFormer's performance by training and testing it on CAMERA25 and REAL275 datasets. Experimental results demonstrate that CatFormer surpasses state-of-the-art methods. Moreover, we extend the usage of CatFormer to instance-level object pose estimation on the LINEMOD dataset, as well as object pose estimation in real-world scenarios. The experimental results validate the effectiveness and generalization capabilities of CatFormer. Our code and the supplemental materials are avaliable at https://github.com/BIT-robot-group/CatFormer.



Paperid:755
Authors:Songsong Yu, Yifan Wang, Yunzhi Zhuge, Lijun Wang, Huchuan Lu
Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology
Abstract:
This paper aims to design monocular depth estimation models with better generalization abilities. To this end, we have conducted quantitative analysis and discovered two important insights. First, the Simulation Correlation phenomenon, commonly seen in longtailed classification problems, also exists in monocular depth estimation, indicating that the imbalanced depth distribution in training data may be the cause of limited generalization ability. Second, the imbalanced and long-tail distribution of depth values extends beyond the dataset scale, and also manifests within each individual image, further exacerbating the challenge of monocular depth estimation. Motivated by the above findings, we propose the Distance-aware Multi-Expert (DME) depth estimation model. Unlike prior methods that handle different depth range indiscriminately, DME adopts a divide-and-conquer philosophy where each expert is responsible for depth estimation of regions within a specific depth range. As such, the depth distribution seen by each expert is more uniform and can be more easily predicted. A pixel-level routing module is further designed and learned to stitch the prediction of all experts into the final depth map. Experiments show that DME achieves state-of-the-art performance on both NYU-Depth v2 and KITTI, and also delivers favorable zero-shot generalization capability on unseen datasets.



Paperid:756
Authors:Xiaoxuan Yu, Hao Wang, Weiming Li, Qiang Wang, Soonyong Cho, Younghun Sung
Samsung Research China - Beijing, Samsung Research China - Beijing, Samsung Research China - Beijing, Samsung Research China - Beijing, Samsung Advanced Institute of Technology, Samsung Advanced Institute of Technology
Abstract:
Point scene understanding is a challenging task to process realworld scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.



Paperid:757
Authors:Xuanlong Yu, Gianni Franchi, Jindong Gu, Emanuel Aldea
SATIE, Paris-Saclay University U2IS, ENSTA Paris, Institut Polytechnique de Paris, U2IS, ENSTA Paris, Institut Polytechnique de Paris, University of Oxford, SATIE, Paris-Saclay University
Abstract:
Uncertainty quantification is critical for deploying deep neural networks (DNNs) in realworld applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks.



Paperid:758
Authors:Zhidong Yu, Wei Yang, Xike Xie, Zhenbo Shi
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China Hefei National Laboratory, Hefei 230088, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
Abstract:
As an essential computer vision task, Continual Semantic Segmentation (CSS) has received a lot of attention. However, security issues regarding this task have not been fully studied. To bridge this gap, we study the problem of attacks in CSS in this paper. We first propose a new task, namely, attacks on incremental samples in CSS, and reveal that the attacks on incremental samples corrupt the performance of CSS in both old and new classes. Moreover, we present an adversarial sample generation method based on class shift, namely Class Shift Attack (CSAttack), which is an offline and easy-to-implement approach for CSS. CS-Attack is able to significantly degrade the performance of models on both old and new classes without knowledge of the incremental learning approach, which undermines the original purpose of the incremental learning, i.e., learning new classes while retaining old knowledge. Experiments show that on the popular datasets Pascal VOC, ADE20k, and Cityscapes, our approach easily degrades the performance of currently popular CSS methods, which reveals the importance of security in CSS.



Paperid:759
Authors:Xiaojian Yuan, Kejiang Chen, Wen Huang, Jie Zhang, Weiming Zhang, Nenghai Yu
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Nanyang Technological University, University of Science and Technology of China, University of Science and Technology of China
Abstract:
The popularity of Machine Learning as a Service (MLaaS) has led to increased concerns about Model Stealing Attacks (MSA), which aim to craft a clone model by querying MLaaS. Currently, most research on MSA assumes that MLaaS can provide soft labels and that the attacker has a proxy dataset with a similar distribution. However, this fails to encapsulate the more practical scenario where only hard labels are returned by MLaaS and the data distribution remains elusive. Furthermore, most existing work focuses solely on stealing the model accuracy, neglecting the model robustness, while robustness is essential in securitysensitive scenarios, e.g, face-scan payment. Notably, improving model robustness often necessitates the use of expensive techniques such as adversarial training, thereby further making stealing robustness a more lucrative prospect. In response to these identified gaps, we introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper, which enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model without the help of any natural data. Comprehensive experiments demonstrate the effectiveness of our method. The clone model achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack, which are only 4.71% and 8.40% lower than the target model on the CIFAR-10 dataset, significantly exceeding the baselines. Our code is available at: https://github.com/LetheSec/DFHL-RS-Attack.



Paperid:760
Authors:Yutao Yuan, Chun Yuan
Tsinghua University, Tsinghua University
Abstract:
Image superresolution is a fundamentally ill-posed problem because multiple valid high-resolution images exist for one low-resolution image. Super-resolution methods based on diffusion probabilistic models can deal with the ill-posed nature by learning the distribution of high-resolution images conditioned on low-resolution images, avoiding the problem of blurry images in PSNR-oriented methods. However, existing diffusion-based super-resolution methods have high time consumption with the use of iterative sampling, while the quality and consistency of generated images are less than ideal due to problems like color shifting. In this paper, we propose Efficient Conditional Diffusion Model with Probability Flow Sampling (ECDP) for image super-resolution. To reduce the time consumption, we design a continuous-time conditional diffusion model for image super-resolution, which enables the use of probability flow sampling for efficient generation. Additionally, to improve the consistency of generated images, we propose a hybrid parametrization for the denoiser network, which interpolates between the data-predicting parametrization and the noise-predicting parametrization for different noise scales. Moreover, we design an image quality loss as a complement to the score matching loss of diffusion models, further improving the consistency and quality of super-resolution. Extensive experiments on DIV2K, ImageNet, and CelebA demonstrate that our method achieves higher super-resolution quality than existing diffusion-based image super-resolution methods while having lower time consumption. Our code is available at https://github.com/Yuan-Yutao/ECDP.



Paperid:761
Authors:Zhenlong Yuan, Jiakai Cao, Zhaoxin Li, Hao Jiang, Zhaoqi Wang
Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Agricultural Information Institute, Chinese Academy of Agricultural Sciences Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences
Abstract:
In this paper, we introduce SegmentationDriven Deformation Multi-View Stereo (SD-MVS), a method that can effectively tackle challenges in 3D reconstruction of textureless areas. We are the first to adopt the Segment Anything Model (SAM) to distinguish semantic instances in scenes and further leverage these constraints for pixelwise patch deformation on both matching cost and propagation. Concurrently, we propose a unique refinement strategy that combines spherical coordinates and gradient descent on normals and pixelwise search interval on depths, significantly improving the completeness of reconstructed 3D model. Furthermore, we adopt the Expectation-Maximization (EM) algorithm to alternately optimize the aggregate matching cost and hyperparameters, effectively mitigating the problem of parameters being excessively dependent on empirical tuning. Evaluations on the ETH3D high-resolution multi-view stereo benchmark and the Tanks and Temples dataset demonstrate that our method can achieve state-of-the-art results with less time consumption.



Paperid:762
Authors:Huanjing Yue, Zifan Cui, Kun Li, Jingyu Yang
Tianjin University, Tianjin University, Tianjin University, Tianjin University
Abstract:
Duallens super-resolution (SR) is a practical scenario for reference (Ref) based SR by utilizing the telephoto image (Ref) to assist the super-resolution of the low-resolution wide-angle image (LR input). Different from general RefSR, the Ref in dual-lens SR only covers the overlapped field of view (FoV) area. However, current dual-lens SR methods rarely utilize these specific characteristics and directly perform dense matching between the LR input and Ref. Due to the resolution gap between LR and Ref, the matching may miss the best-matched candidate and destroy the consistent structures in the overlapped FoV area. Different from them, we propose to first align the Ref with the center region (namely the overlapped FoV area) of the LR input by combining global warping and local warping to make the aligned Ref be sharp and consistent. Then, we formulate the aligned Ref and LR center as value-key pairs, and the corner region of the LR is formulated as queries. In this way, we propose a kernel-free matching strategy by matching between the LR-corner (query) and LR-center (key) regions, and the corresponding aligned Ref (value) can be warped to the corner region of the target. Our kernel-free matching strategy avoids the resolution gap between LR and Ref, which makes our network have better generalization ability. In addition, we construct a DuSR-Real dataset with (LR, Ref, HR) triples, where the LR and HR are well aligned. Experiments on three datasets demonstrate that our method outperforms the second-best method by a large margin. Our code and dataset are available at https://github.com/ZifanCui/KeDuSR.



Paperid:763
Authors:Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, Zhiyong Wang
The University of Sydney, The University of Sydney, The University of Sydney, Northwestern Polytechnical University, University of Rochester, The University of Sydney
Abstract:
The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zeroshot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to inferior generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM’s pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code is available at https://github.com/wenxi-yue/SurgicalSAM.



Paperid:764
Authors:Ziyu Yue, Jiaxin Gao, Zhixun Su
Dalian University of Technology, Key Laboratory for Computational Mathematics and Data Intelligence of Liaoning Province, Dalian University of Technology, Dalian University of Technology, Key Laboratory for Computational Mathematics and Data Intelligence of Liaoning Province
Abstract:
Existing superresolution methods exhibit limitations when applied to nighttime scenes, primarily due to their lack of adaptation to low-pair dynamic range and noise-heavy dark-light images. In response, this research introduces an innovative customized framework to simultaneously Brighten and Zoom in low-resolution images captured in low-light conditions, dubbed BrZoNet. The core method begins by feeding low-light, low-resolution images, and their corresponding ground truths into the Retinex-induced siamese decoupling network. This process yields distinct reflectance maps and illuminance maps, guided by supervision from the ground truth’s decomposition maps. Subsequently, these reflectance and illuminance maps transition into an intricate super-resolution sub-network. This sub-network employs a meticulously designed cross-layer content-aware interactor - Illumination-aware Interaction Unit(IaIU), elegantly endowed with a gating mechanism. The IaIU facilitates meaningful feature interaction between illuminance and reflectance features while effectively reducing unwanted noise. An intricate super-resolution cage is also constructed to comprehensively integrate information, ultimately resulting in the generation of high-resolution images featuring intricate details. Thorough and diverse experiments validate the superiority of the proposed BrZoNet, surpassing contemporary cutting-edge technologies by proficiently augmenting brightness and intricately recovering complex details, showcasing advancements of 7.1% in PSNR, 2.4% in SSIM, and an impressive 36.8% in LPIPS metrics.



Paperid:765
Authors:Wulian Yun, Mengshi Qi, Chuanming Wang, Huadong Ma
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications, Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications, Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications, Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts and Telecommunications
Abstract:
Weaklysupervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos simultaneously by taking only video-level labels as the supervision. Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video that can provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature. First, we design a saliency inference module that exploits the variation relationship between temporal neighbor snippets to discover salient snippet-features, which can reflect the significant dynamic change in the video. Secondly, we introduce a boundary refinement module that enhances salient snippet-features through the information interaction unit. Then, a discrimination enhancement module is introduced to enhance the discriminative nature of snippet-features. Finally, we adopt the refined snippet-features to produce high-fidelity pseudo labels, which could be used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods. Our source code is available at https://github.com/wuli55555/ISSF.



Paperid:766
Authors:Xiao Yun, Chenglong Xu, Kevin Riou, Kaiwen Dong, Yanjing Sun, Song Li, Kevin Subrin, Patrick Le Callet
China University of Mining and Technology, China University of Mining and Technology, Nantes Université, China University of Mining and Technology, China University of Mining and Technology, China University of Mining and Technology, Nantes Université, Nantes Université
Abstract:
The deployment of multistream fusion strategy on behavioral recognition from skeletal data can extract complementary features from different information streams and improve the recognition accuracy, but suffers from high model complexity and a large number of parameters. Besides, existing multi-stream methods using a fixed adjacency matrix homogenizes the model’s discrimination process across diverse actions, causing reduction of the actual lift for the multi-stream model. Finally, attention mechanisms are commonly applied to the multi-dimensional features, including spatial, temporal and channel dimensions. But their attention scores are typically fused in a concatenated manner, leading to the ignorance of the interrelation between joints in complex actions. To alleviate these issues, the Front-Rear dual Fusion Graph Convolutional Network (FRF-GCN) is proposed to provide a lightweight model based on skeletal data. Targeted adjacency matrices are also designed for different front fusion streams, allowing the model to focus on actions of varying magnitudes. Simultaneously, the mechanism of Spatial-Temporal-Channel Parallel Attention (STC-P), which processes attention in parallel and places greater emphasis on useful information, is proposed to further improve model’s performance. FRF-GCN demonstrates significant competitiveness compared to the current state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton 400 datasets. Our code is available at: https://github.com/sunbeam-kkt/FRF-GCN-master.



Paperid:767
Authors:Zhengqing Zang, Chenyu Lin, Chenwei Tang, Tao Wang, Jiancheng Lv
College of Computer Science, Sichuan University, Chengdu, 610065, P. R. China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, 610065, P. R. China, College of Computer Science, Sichuan University, Chengdu, 610065, P. R. China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, 610065, P. R. China, College of Computer Science, Sichuan University, Chengdu, 610065, P. R. China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, 610065, P. R. China, College of Computer Science, Sichuan University, Chengdu, 610065, P. R. China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, 610065, P. R. China, College of Computer Science, Sichuan University, Chengdu, 610065, P. R. China Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, 610065, P. R. China
Abstract:
Existing object detection models are mainly trained on largescale labeled datasets. However, annotating data for novel aerial object classes is expensive since it is time-consuming and may require expert knowledge. Thus, it is desirable to study label-efficient object detection methods on aerial images. In this work, we propose a zero-shot method for aerial object detection named visual Description Regularization, or DescReg. Concretely, we identify the weak semantic-visual correlation of the aerial objects and aim to address the challenge with prior descriptions of their visual appearance. Instead of directly encoding the descriptions into class embedding space which suffers from the representation gap problem, we propose to infuse the prior inter-class visual similarity conveyed in the descriptions into the embedding learning. The infusion process is accomplished with a newly designed similarity-aware triplet loss which incorporates structured regularization on the representation space. We conduct extensive experiments with three challenging aerial object detection datasets, including DIOR, xView, and DOTA. The results demonstrate that DescReg significantly outperforms the state-of-the-art ZSD methods with complex projection designs and generative frameworks, e.g., DescReg outperforms best reported ZSD method on DIOR by 4.5 mAP on unseen classes and 8.1 in HM. We further show the generalizability of DescReg by integrating it into generative ZSD methods as well as varying the detection architecture. Codes will be released at https://github.com/zq-zang/DescReg.



Paperid:768
Authors:Bohan Zeng, Shanglin Li, Xuhui Liu, Sicheng Gao, Xiaolong Jiang, Xu Tang, Yao Hu, Jianzhuang Liu, Baochang Zhang
Beihang University, Beihang University, Beihang University, Beihang University, Xiaohongshu Inc., Xiaohongshu Inc., Xiaohongshu Inc., Shenzhen Institute of Advanced Technology, Shenzhen, China, Beihang University Zhongguancun Laboratory, Beijing, China Nanchang Institute of Technology, Nanchang, China
Abstract:
Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models. Diffusionbased methods have recently shown promise in analyzing functional magnetic resonance imaging (fMRI) data, including the reconstruction of high-quality images consistent with original visual stimuli. Nonetheless, it remains a critical challenge to effectively harness the semantic and silhouette information extracted from brain signals. In this paper, we propose a novel approach, termed as Controllable Mind Visual Diffusion Model (CMVDM). Specifically, CMVDM first extracts semantic and silhouette information from fMRI data using attribute alignment and assistant networks. Then, a control model is introduced in conjunction with a residual block to fully exploit the extracted information for image synthesis, generating high-quality images that closely resemble the original visual stimuli in both semantic content and silhouette characteristics. Through extensive experimentation, we demonstrate that CMVDM outperforms existing state-of-the-art methods both qualitatively and quantitatively. Our code is available at https://github.com/zengbohan0217/CMVDM.



Paperid:769
Authors:Kunlun Zeng, Ri Cheng, Weimin Tan, Bo Yan
School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
Abstract:
Deep learningbased models have made great progress in image tampering localization, which aims to distinguish between manipulated and authentic regions. However, these models suffer from inefficient training. This is because they use ground-truth mask labels mainly through the cross-entropy loss, which prioritizes per-pixel precision but disregards the spatial location and shape details of manipulated regions. To address this problem, we propose a Mask-Guided Query-based Transformer Framework (MGQFormer), which uses ground-truth masks to guide the learnable query token (LQT) in identifying the forged regions. Specifically, we extract feature embeddings of ground-truth masks as the guiding query token (GQT) and feed GQT and LQT into MGQFormer to estimate fake regions, respectively. Then we make MGQFormer learn the position and shape information in ground-truth mask labels by proposing a mask-guided loss to reduce the feature distance between GQT and LQT. We also observe that such mask-guided training strategy has a significant impact on the convergence speed of MGQFormer training. Extensive experiments on multiple benchmarks show that our method significantly improves over state-of-the-art methods.



Paperid:770
Authors:Mingfeng Zha, Yunqiang Pei, Guoqing Wang, Tianyu Li, Yang Yang, Wenbin Qian, Heng Tao Shen
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, Jiangxi Agricultural University, University of Electronic Science and Technology of China
Abstract:
Mirror detection is of great significance for avoiding false recognition of reflected objects in computer vision tasks. Existing mirror detection frameworks usually follow a supervised setting, which relies heavily on high quality labels and suffers from poor generalization. To resolve this, we instead propose the first weaklysupervised mirror detection framework and also provide the first scribble-based mirror dataset. Specifically, we relabel 10,158 images, most of which have a labeled pixel ratio of less than 0.01 and take only about 8 seconds to label. Considering that the mirror regions usually show great scale variation, and also irregular and occluded, thus leading to issues of incomplete or over detection, we propose a local-global feature enhancement (LGFE) module to fully capture the context and details. Moreover, it is difficult to obtain basic mirror structure using scribble annotation, and the distinction between foreground (mirror) and background (non-mirror) features is not emphasized caused by mirror reflections. Therefore, we propose a foreground-aware mask attention (FAMA), integrating mirror edges and semantic features to complete mirror regions and suppressing the influence of backgrounds. Finally, to improve the robustness of the network, we propose a prototype contrast loss (PCL) to learn more general foreground features across images. Extensive experiments show that our network outperforms relevant state-of-the-art weakly supervised methods, and even some fully supervised methods. The dataset and codes are available at https://github.com/winter-flow/WSMD.



Paperid:771
Authors:Yaohua Zha, Huizhen Ji, Jinmin Li, Rongsheng Li, Tao Dai, Bin Chen, Zhi Wang, Shu-Tao Xia
Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University, Tsinghua Shenzhen International Graduate School, Tsinghua University, Tsinghua Shenzhen International Graduate School, Tsinghua University, College of Computer Science and Software Engineering, Shenzhen University, Harbin Institute of Technology, Shenzhen, Tsinghua Shenzhen International Graduate School, Tsinghua University, Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory
Abstract:
Learning 3D representation plays a critical role in masked autoencoder (MAE) based pretraining methods for point cloud, including single-modal and cross-modal based MAE. Specifically, although cross-modal MAE methods learn strong 3D representations via the auxiliary of other modal knowledge, they often suffer from heavy computational burdens and heavily rely on massive cross-modal data pairs that are often unavailable, which hinders their applications in practice. Instead, single-modal methods with solely point clouds as input are preferred in real applications due to their simplicity and efficiency. However, such methods easily suffer from limited 3D representations with global random mask input. To learn compact 3D representations, we propose a simple yet effective Point Feature Enhancement Masked Autoencoders (Point-FEMAE), which mainly consists of a global branch and a local branch to capture latent semantic features. Specifically, to learn more compact features, a share-parameter Transformer encoder is introduced to extract point features from the global and local unmasked patches obtained by global random and local block mask strategies, followed by a specific decoder to reconstruct. Meanwhile, to further enhance features in the local branch, we propose a Local Enhancement Module with local patch convolution to perceive fine-grained local context at larger scales. Our method significantly improves the pre-training efficiency compared to cross-modal alternatives, and extensive downstream experiments underscore the state-of-the-art effectiveness, particularly outperforming our baseline (Point-MAE) by 5.16%, 5.00%, and 5.04% in three variants of ScanObjectNN, respectively. Code is available at https://github.com/zyh16143998882/AAAI24-PointFEMAE.



Paperid:772
Authors:Jiang-Tian Zhai, Xialei Liu, Lu Yu, Ming-Ming Cheng
Nankai University, Nankai University, Tianjin University of Technology, Nankai University
Abstract:
Nonexemplar class incremental learning aims to learn both the new and old tasks without accessing any training data from the past. This strict restriction enlarges the difficulty of alleviating catastrophic forgetting since all techniques can only be applied to current task data. Considering this challenge, we propose a novel framework of fine-grained knowledge selection and restoration. The conventional knowledge distillation-based methods place too strict constraints on the network parameters and features to prevent forgetting, which limits the training of new tasks. To loose this constraint, we proposed a novel fine-grained selective patch-level distillation to adaptively balance plasticity and stability. Some task-agnostic patches can be used to preserve the decision boundary of the old task. While some patches containing the important foreground are favorable for learning the new task. Moreover, we employ a task-agnostic mechanism to generate more realistic prototypes of old tasks with the current task sample for reducing classifier bias for fine-grained knowledge restoration. Extensive experiments on CIFAR100, TinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method. Code is available at https://github.com/scok30/vit-cil.



Paperid:773
Authors:Yajing Zhai, Yawen Zeng, Zhiyong Huang, Zheng Qin, Xin Jin, Da Cao
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China, College of Computer Science and Electronic Engineering, Hunan University, Changsha, China, National University of Singapore, NUS Research Institute in Chongqing, College of Computer Science and Electronic Engineering, Hunan University, Changsha, China, Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China, College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
Abstract:
The finegrained attribute descriptions can significantly supplement the valuable semantic information for person image, which is vital to the success of person re-identification (ReID) task. However, current ReID algorithms typically failed to effectively leverage the rich contextual information available, primarily due to their reliance on simplistic and coarse utilization of image attributes. Recent advances in artificial intelligence generated content have made it possible to automatically generate plentiful fine-grained attribute descriptions and make full use of them. Thereby, this paper explores the potential of using the generated multiple person attributes as prompts in ReID tasks with off-the-shelf (large) models for more accurate retrieval results. To this end, we present a new framework called Multi-Prompts ReID (MP-ReID), based on prompt learning and language models, to fully dip fine attributes to assist ReID task. Specifically, MP-ReID first learns to hallucinate diverse, informative, and promptable sentences for describing the query images. This procedure includes (i) explicit prompts of which attributes a person has and furthermore (ii) implicit learnable prompts for adjusting/conditioning the criteria used towards this person identity matching. Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models. Moreover, an alignment module is designed to fuse multi-prompts (i.e., explicit and implicit ones) progressively and mitigate the cross-modal gap. Extensive experiments on the existing attribute-involved ReID datasets, namely, Market1501 and DukeMTMC-reID, demonstrate the effectiveness and rationality of the proposed MP-ReID solution.



Paperid:774
Authors:Yang Zhan, Yuan Yuan, Zhitong Xiong
Northwestern Polytechnical University, Northwestern Polytechnical University, Techinical University of Munich
Abstract:
We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a largescale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.



Paperid:775
Authors:Bowen Zhang, Qing Liu, Jianming Zhang, Yilin Wang, Liyang Liu, Zhe Lin, Yifan Liu
The University of Adelaide, Adobe Research, Adobe Research, Adobe Research, The University of Adelaide, Adobe Research, The University of Adelaide
Abstract:
Amodal scene analysis entails interpreting the occlusion relationship among scene elements and inferring the possible shapes of the invisible parts. Existing methods typically frame this task as an extended instance segmentation or a pairwise object de-occlusion problem. In this work, we propose a new framework, which comprises a Holistic Occlusion Relation Inference (HORI) module followed by an instance-level Generative Mask Completion (GMC) module. Unlike previous approaches, which rely on mask completion results for occlusion reasoning, our HORI module directly predicts an occlusion relation matrix in a single pass. This approach is much more efficient than the pair-wise de-occlusion process and it naturally handles mutual occlusion, a common but often neglected situation. Moreover, we formulate the mask completion task as a generative process and use a diffusion-based GMC module for instance-level mask completion. This improves mask completion quality and provides multiple plausible solutions. We further introduce a large-scale amodal segmentation dataset with high-quality human annotations, including mutual occlusions. Experiments on our dataset and two public benchmarks demonstrate the advantages of our method. code public available at https://github.com/zbwxp/Amodal-AAAI.



Paperid:776
Authors:Boyu Zhang, Hongliang Yuan
University of California, Los Angeles Tencent AI Lab, Xiaomi Cooperation Tencent AI Lab
Abstract:
Generating highquality, realistic rendering images for real-time applications generally requires tracing a few samples-per-pixel (spp) and using deep learning-based approaches to denoise the resulting low-spp images. Existing denoising methods necessitate a substantial time expenditure when rendering at high resolutions due to the physically-based sampling and network inference time burdens. In this paper, we propose a novel Monte Carlo sampling strategy to accelerate the sampling process and a corresponding denoiser, subpixel sampling reconstruction (SSR), to obtain high-quality images. Extensive experiments demonstrate that our method significantly outperforms previous approaches in denoising quality and reduces overall time costs, enabling real-time rendering capabilities at 2K resolution.



Paperid:777
Authors:Chiyu Zhang, Xiaogang Xu, Lei Wang, Zaiyan Dai, Jun Yang
Sichuan Normal University Nanjing University of Aeronautics and Astronautics, Zhejiang Lab Zhejiang University, Sichuan Normal University, Sichuan Normal University, Sichuan Normal University Visual Computing and Virtual Reality Key Laboratory of Sichuan Provience
Abstract:
Transformer's recent integration into style transfer leverages its proficiency in establishing longrange dependencies, albeit at the expense of attenuated local modeling. This paper introduces Strips Window Attention Transformer (S2WAT), a novel hierarchical vision transformer designed for style transfer. S2WAT employs attention computation in diverse window shapes to capture both short- and long-range dependencies. The merged dependencies utilize the "Attn Merge" strategy, which adaptively determines spatial weights based on their relevance to the target. Extensive experiments on representative datasets show the proposed method's effectiveness compared to state-of-the-art (SOTA) transformer-based and other approaches. The code and pre-trained models are available at https://github.com/AlienZhang1996/S2WAT.



Paperid:778
Authors:Dehuan Zhang, Jingchun Zhou, Chunle Guo, Weishi Zhang, Chongyi Li
Dalian Maritime University, Dalian Maritime University, Nankai University, Dalian Maritime University, Nankai University
Abstract:
Visually restoring underwater scenes primarily involves mitigating interference from underwater media. Existing methods ignore the inherent scalerelated characteristics in underwater scenes. Therefore, we present the synergistic multi-scale detail refinement via intrinsic supervision (SMDR-IS) for enhancing underwater scene details, which contain multi-stages. The low-degradation stage from the original images furnishes the original stage with multi-scale details, achieved through feature propagation using the Adaptive Selective Intrinsic Supervised Feature (ASISF) module. By using intrinsic supervision, the ASISF module can precisely control and guide feature transmission across multi-degradation stages, enhancing multi-scale detail refinement and minimizing the interference from irrelevant information in the low-degradation stage. In multi-degradation encoder-decoder framework of SMDR-IS, we introduce the Bifocal Intrinsic-Context Attention Module (BICA). Based on the intrinsic supervision principles, BICA efficiently exploits multi-scale scene information in images. BICA directs higher-resolution spaces by tapping into the insights of lower-resolution ones, underscoring the pivotal role of spatial contextual relationships in underwater image restoration. Throughout training, the inclusion of a multi-degradation loss function can enhance the network, allowing it to adeptly extract information across diverse scales. When benchmarked against state-of-the-art methods, SMDR-IS consistently showcases superior performance. Our code is available at https://github.com/zhoujingchun03/SMDR-IS



Paperid:779
Authors:Fangyuan Zhang, Tianxiang Pan, Jun-Hai Yong, Bin Wang
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
Current weaklysupervised semantic segmentation (WSSS) techniques concentrate on enhancing class activation maps (CAMs) with image-level annotations. Yet, the emphasis on producing these pseudo-labels often overshadows the pivotal role of training the segmentation model itself. This paper underscores the significant influence of noisy pseudo-labels on segmentation network performance, particularly in boundary region. To address above issues, we introduce a novel paradigm: Weak to Partial Supervision (W2P). At its core, W2P categorizes the pseudo-labels from WSSS into two unique supervisions: trustworthy clean labels and uncertain noisy labels. Next, our proposed partially-supervised framework adeptly employs these clean labels to rectify the noisy ones, thereby promoting the continuous enhancement of the segmentation model. To further optimize boundary segmentation, we incorporate a noise detection mechanism that specifically preserves boundary regions while eliminating noise. During the noise refinement phase, we adopt a boundary-conscious noise correction technique to extract comprehensive boundaries from noisy areas. Furthermore, we devise a boundary generation approach that assists in predicting intricate boundary zones. Evaluations on the PASCAL VOC 2012 and MS COCO 2014 datasets confirm our method's impressive segmentation capabilities across various pseudo-labels.



Paperid:780
Authors:Hai Zhang, Chunwei Wu, Guitao Cao, Hailing Wang, Wenming Cao
Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, China MoE Engineering Research Center of SW/HW Co-design Technology and Application, East China Normal University, China, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, China MoE Engineering Research Center of SW/HW Co-design Technology and Application, East China Normal University, China, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, China MoE Engineering Research Center of SW/HW Co-design Technology and Application, East China Normal University, China, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, China MoE Engineering Research Center of SW/HW Co-design Technology and Application, East China Normal University, China, College of Information Engineering, Shenzhen University, China
Abstract:
Editing real images authentically while also achieving crossdomain editing remains a challenge. Recent studies have focused on converting real images into latent codes and accomplishing image editing by manipulating these codes. However, merely manipulating the latent codes would constrain the edited images to the generator's image domain, hindering the attainment of diverse editing goals. In response, we propose an innovative image editing method called HyperEditor, which utilizes weight factors generated by hypernetworks to reassign the weights of the pre-trained StyleGAN2's generator. Guided by CLIP's cross-modal image-text semantic alignment, this innovative approach enables us to simultaneously accomplish authentic attribute editing and cross-domain style transfer, a capability not realized in previous methods. Additionally, we ascertain that modifying only the weights of specific layers in the generator can yield an equivalent editing result. Therefore, we introduce an adaptive layer selector, enabling our hypernetworks to autonomously identify the layers requiring output weight factors, which can further improve our hypernetworks' efficiency. Extensive experiments on abundant challenging datasets demonstrate the effectiveness of our method.



Paperid:781
Authors:Haiming Zhang, Xu Yan, Dongfeng Bai, Jiantao Gao, Pan Wang, Bingbing Liu, Shuguang Cui, Zhen Li
FNii, CUHK-Shenzhen, Shenzhen, China SSE, CUHK-Shenzhen, Shenzhen, China, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, SSE, CUHK-Shenzhen, Shenzhen, China FNii, CUHK-Shenzhen, Shenzhen, China, SSE, CUHK-Shenzhen, Shenzhen, China FNii, CUHK-Shenzhen, Shenzhen, China
Abstract:
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multiview images. However, image-based scene perception encounters significant challenges in achieving accurate prediction due to the absence of geometric priors. In this paper, we address this issue by exploring cross-modal knowledge distillation in this task, i.e., we leverage a stronger multi-modal model to guide the visual model during training. In practice, we observe that directly applying features or logits alignment, proposed and widely used in bird's-eye-view (BEV) perception, does not yield satisfactory results. To overcome this problem, we introduce RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction. By employing differentiable volume rendering, we generate depth and semantic maps in perspective views and propose two novel consistency criteria between the rendered outputs of teacher and student models. Specifically, the depth consistency loss aligns the termination distributions of the rendered rays, while the semantic consistency loss mimics the intra-segment similarity guided by vision foundation models (VLMs). Experimental results on the nuScenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D benchmark.



Paperid:782
Authors:Haiyu Zhang, Shaolin Su, Yu Zhu, Jinqiu Sun, Yanning Zhang
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Single image superresolution (SISR), especially in the real world, usually builds a large amount of LR-HR image pairs to learn representations that contain rich textural and structural information. However, relying on massive data for model training not only reduces training efficiency, but also causes heavy data storage burdens. In this paper, we attempt a pioneering study on dataset distillation (DD) for SISR problems to explore how data could be slimmed and compressed for the task. Unlike previous coreset selection methods which select a few typical examples directly from the original data, we remove the limitation that the selected data cannot be further edited, and propose to synthesize and optimize samples to preserve more task-useful representations. Concretely, by utilizing pre-trained GANs as a suitable approximation of realistic data distribution, we propose GSDD, which distills data in a latent generative space based on GAN-inversion techniques. By optimizing them to match with the practical data distribution in an informative feature space, the distilled data could then be synthesized. Experimental results demonstrate that when trained with our distilled data, GSDD can achieve comparable performance to the state-of-the-art (SOTA) SISR algorithms, while a nearly ×8 increase in training efficiency and a saving of almost 93.2% data storage space can be realized. Further experiments on challenging real-world data also demonstrate the promising generalization ability of GSDD.



Paperid:783
Authors:Hao Zhang, Fang Li, Lu Qi, Ming-Hsuan Yang, Narendra Ahuja
University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign, The University of California, Merced, University of California at Merced, University of Illinois at Urbana-Champaign, USA
Abstract:
Addressing OutOf-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. Our soft assignment and mask split methodologies enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks.



Paperid:784
Authors:Hao Zhang, Xuhui Zuo, Huabing Zhou, Tao Lu, Jiayi Ma
Wuhan University, Wuhan University, Wuhan Institute of Technology, Wuhan Institute of Technology, Wuhan University
Abstract:
This work proposes a robust 3D medical image fusion framework to establish a mutualreinforcing mechanism between visual fusion and lesion segmentation, achieving their double improvement. Specifically, we explore the consistency between vision and semantics by sharing feature fusion modules. Through the coupled optimization of the visual fusion loss and the lesion segmentation loss, visual-related and semantic-related features will be pulled into the same domain, effectively promoting accuracy improvement in a mutual-reinforcing manner. Further, we establish the robustness guarantees by constructing a two-level refinement constraint in the process of feature extraction and reconstruction. Benefiting from full consideration for common degradations in medical images, our framework can not only provide clear visual fusion results for doctor's observation, but also enhance the defense ability of lesion segmentation against these negatives. Extensive evaluations of visual fusion and lesion segmentation scenarios demonstrate the advantages of our method in terms of accuracy and robustness. Moreover, our proposed framework is generic, which can be well-compatible with existing lesion segmentation algorithms and improve their performance. The code is publicly available at https://github.com/HaoZhang1018/RMR-Fusion.



Paperid:785
Authors:Hongquan Zhang, Bin-Bin Gao, Yi Zeng, Xudong Tian, Xin Tan, Zhizhong Zhang, Yanyun Qu, Jun Liu, Yuan Xie
East China Normal University Chongqing Institute of East China Normal University, Tencent YouTu Lab, Tencent YouTu Lab, East China Normal University Chongqing Institute of East China Normal University, East China Normal University Chongqing Institute of East China Normal University, East China Normal University Chongqing Institute of East China Normal University, Xiamen University, Tencent YouTu Lab, East China Normal University Chongqing Institute of East China Normal University
Abstract:
Classincremental object detection (CIOD) is a real-world desired capability, requiring an object detector to continuously adapt to new tasks without forgetting learned ones, with the main challenge being catastrophic forgetting. Many methods based on distillation and replay have been proposed to alleviate this problem. However, they typically learn on a pure visual backbone, neglecting the powerful representation capabilities of textual cues, which to some extent limits their performance. In this paper, we propose task-aware language-image representation to mitigate catastrophic forgetting, introducing a new paradigm for language-image-based CIOD. First of all, we demonstrate the significant advantage of language-image detectors in mitigating catastrophic forgetting. Secondly, we propose a learning task-aware language-image representation method that overcomes the existing drawback of directly utilizing the language-image detector for CIOD. More specifically, we learn the language-image representation of different tasks through an insulating approach in the training stage, while using the alignment scores produced by task-specific language-image representation in the inference stage. Through our proposed method, language-image detectors can be more practical for CIOD. We conduct extensive experiments on COCO 2017 and Pascal VOC 2007 and demonstrate that the proposed method achieves state-of-the-art results under the various CIOD settings.



Paperid:786
Authors:Huatian Zhang, Lei Zhang, Kun Zhang, Zhendong Mao
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China
Abstract:
Imagetext matching bridges vision and language, which is a fundamental task in multimodal intelligence. Its key challenge lies in how to capture visual-semantic relevance. Fine-grained semantic interactions come from fragment alignments between image regions and text words. However, not all fragments contribute to image-text relevance, and many existing methods are devoted to mining the vital ones to measure the relevance accurately. How well image and text relate depends on the degree of semantic sharing between them. Treating the degree as an effect and fragments as its possible causes, we define those indispensable causes for the generation of the degree as necessary undertakers, i.e., if any of them did not occur, the relevance would be no longer valid. In this paper, we revisit image-text matching in the causal view and uncover inherent causal properties of relevance generation. Then we propose a novel theoretical prototype for estimating the probability-of-necessity of fragments, PN_f, for the degree of semantic sharing by means of causal inference, and further design a Necessary Undertaker Identification Framework (NUIF) for image-text matching, which explicitly formalizes the fragment's contribution to image-text relevance by modeling PN_f in two ways. Extensive experiments show our method achieves state-of-the-art on benchmarks Flickr30K and MSCOCO.



Paperid:787
Authors:Huaxin Zhang, Xiang Wang, Xiaohao Xu, Zhiwu Qing, Changxin Gao, Nong Sang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, University of Michigan, Ann Arbor, Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Abstract:
Pointsupervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully-supervised methods. Code will be available at https://github.com/pipixin321/HR-Pro.



Paperid:788
Authors:Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Daniel Du, Min Zheng
ByteDance, ByteDance, ByteDance, ByteDance Carnegie Mellon University, ByteDance, ByteDance, ByteDance, ByteDance, ByteDance, ByteDance
Abstract:
Creating expressive, diverse and highquality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io/ .



Paperid:789
Authors:Jianping Zhang, Yizhan Huang, Zhuoer Xu, Weibin Wu, Michael R. Lyu
The Chinese University of Hong Kong, The Chinese University of Hong Kong, Antgroup, Sun Yat-sen University, The Chinese University of Hong Kong
Abstract:
With the great achievement of vision transformers (ViTs), transformerbased approaches have become the new paradigm for solving various computer vision tasks. However, recent research shows that similar to convolutional neural networks (CNNs), ViTs are still vulnerable to adversarial attacks. To explore the shared deficiency of models with different structures, researchers begin to analyze the cross-structure adversarial transferability, which is still under-explored. Therefore, in this work, we focus on the ViT attacks to improve the cross-structure transferability between the transformer-based and convolution-based models. Previous studies fail to thoroughly investigate the influence of the components inside the ViT models on adversarial transferability, leading to inferior performance. To overcome the drawback, we launch a motivating study by linearly down-scaling the gradients of components inside the ViT models to analyze their influence on adversarial transferability. Based on the motivating study, we find that the gradient of the skip connection most influences transferability and believe that back-propagating gradients from deeper blocks can enhance transferability. Therefore, we propose the Virtual Dense Connection method (VDC). Specifically, without changing the forward pass, we first recompose the original network to add virtual dense connections. Then we back-propagate gradients of deeper Attention maps and Multi-layer Perceptron (MLP) blocks via virtual dense connections when generating adversarial samples. Extensive experiments confirm the superiority of our proposed method over the state-of-the-art baselines, with an 8.2% improvement in transferability between ViT models and a 7.2% improvement in cross-structure transferability from ViTs to CNNs.



Paperid:790
Authors:Jianping Zhang, Wenwei Gu, Yizhan Huang, Zhihan Jiang, Weibin Wu, Michael R. Lyu
The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, Sun Yat-sen University, The Chinese University of Hong Kong
Abstract:
Imperceptibility is one of the crucial requirements for adversarial examples. Previous adversarial attacks on 3D point cloud recognition suffer from noticeable outliers, resulting in low imperceptibility. We think that the drawbacks can be alleviated via taking the local curvature of the point cloud into consideration. Existing approaches introduce the local geometry distance into the attack objective function. However, their definition of the local geometry distance neglects different perceptibility of distortions along different directions. In this paper, we aim to enhance the imperceptibility of adversarial attacks on 3D point cloud recognition by better preserving the local curvature of the original 3D point clouds. To this end, we propose the CurvatureInvariant Method (CIM), which directly regularizes the back-propagated gradient during the generation of adversarial point clouds based on two assumptions. Specifically, we first decompose the back-propagated gradients into the tangent plane and the normal direction. Then we directly reduce the gradient along the large curvature direction on the tangent plane and only keep the gradient along the negative normal direction. Comprehensive experimental comparisons confirm the superiority of our approach. Notably, our strategy can achieve 7.2% and 14.5% improvements in Hausdorff distance and Gaussian curvature measurements of the imperceptibility.



Paperid:791
Authors:Jing Zhang, Xiaoqiang Liu, Mingzhe Chen, Zhe Wang
Department of Computer Science and Engineering, East China University of Science and Technology, China, Department of Computer Science and Engineering, East China University of Science and Technology, China, Department of Computer Science and Engineering, East China University of Science and Technology, China, Department of Computer Science and Engineering, East China University of Science and Technology, China
Abstract:
Fewshot Visual Question Answering (VQA) realizes few-shot cross-modal learning, which is an emerging and challenging task in computer vision. Currently, most of the few-shot VQA methods are confined to simply extending few-shot classification methods to cross-modal tasks while ignoring the spatial distribution properties of multimodal features and cross-modal information interaction. To address this problem, we propose a novel Cross-modal feature Distribution Calibration Inference Network (CDCIN) in this paper, where a new concept named visual information entropy is proposed to realize multimodal features distribution calibration by cross-modal information interaction for more effective few-shot VQA. Visual information entropy is a statistical variable that represents the spatial distribution of visual features guided by the question, which is aligned before and after the reasoning process to mitigate redundant information and improve multi-modal features by our proposed visual information entropy calibration module. To further enhance the inference ability of cross-modal features, we additionally propose a novel pre-training method, where the reasoning sub-network of CDCIN is pretrained on the base class in a VQA classification paradigm and fine-tuned on the few-shot VQA datasets. Extensive experiments demonstrate that our proposed CDCIN achieves excellent performance on few-shot VQA and outperforms state-of-the-art methods on three widely used benchmark datasets.



Paperid:792
Authors:Jingwen Zhang, Zikun Zhou, Guangming Lu, Jiandong Tian, Wenjie Pei
Harbin Institute of Technology, Shenzhen, Peng Cheng Laboratory, Harbin Institute of Technology, Shenzhen, Shenyang Institute of Automation, Chinese Academy of Sciences, Harbin Institute of Technology, Shenzhen
Abstract:
3D single object tracking remains a challenging problem due to the sparsity and incompleteness of the point clouds. Existing algorithms attempt to address the challenges in two strategies. The first strategy is to learn dense geometric features based on the captured sparse point cloud. Nevertheless, it is quite a formidable task since the learned dense geometric features are with high uncertainty for depicting the shape of the target object. The other strategy is to aggregate the sparse geometric features of multiple templates to enrich the shape information, which is a routine solution in 2D tracking. However, aggregating the coarse shape representations can hardly yield a precise shape representation. Different from 2D pixels, 3D points of different frames can be directly fused by coordinate transform, i.e., shape completion. Considering that, we propose to construct a synthetic target representation composed of dense and complete point clouds depicting the target shape precisely by shape completion for robust 3D tracking. Specifically, we design a voxelized 3D tracking framework with shape completion, in which we propose a qualityaware shape completion mechanism to alleviate the adverse effect of noisy historical predictions. It enables us to effectively construct and leverage the synthetic target representation. Besides, we also develop a voxelized relation modeling module and box refinement module to improve tracking performance. Favorable performance against state-of-the-art algorithms on three benchmarks demonstrates the effectiveness and generalization ability of our method.



Paperid:793
Authors:Jingyi Zhang, Qihong Mao, Guosheng Hu, Siqi Shen, Cheng Wang
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China, Oosto, Belfast, UK, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China, Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, China Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, China
Abstract:
3D human pose estimation (3HPE) in largescale outdoor scenes using commercial LiDAR has attracted significant attention due to its potential for real-life applications. However, existing LiDAR-based methods for 3HPE primarily rely on recovering 3D human poses from individual point clouds, and the coherence cues present in the neighborhood are not sufficiently harnessed. In this work, we explore spatial and contexture coherence cues contained in the neighborhood that lead to great performance improvements in 3HPE. Specifically, firstly, we deeply investigate the 3D neighbor in the background (3BN) which serves as a spatial coherence cue for inferring reliable motion since it provides physical laws to limit motion targets. Secondly, we introduce a novel 3D scanning neighbor (3SN) generated during the data collection and 3SN implies structural edge coherence cues. We use 3SN to overcome the degradation of performance and data quality caused by the sparsity-varying properties of LiDAR point clouds. In order to effectively model the complementation between these distinct cues and build consistent temporal relationships across human motions, we propose a new transformer-based module called the CoherenceFuse module. Extensive experiments were conducted on publicly available datasets, namely LidarHuman26M, CIMI4D, SLOPER4D and Waymo Open Dataset v2.0, showcase the superiority and effectiveness of our proposed method. In particular, when compared with LidarCap on the LidarHuman26M dataset, our method demonstrates a reduction of 7.08mm in the average MPJPE metric, along with a decrease of 16.55mm in the MPJPE metric for distances exceeding 25 meters. The code and models are available at https://github.com/jingyi-zhang/Neighborhood-enhanced-LidarCap.



Paperid:794
Authors:Junge Zhang, Feihu Zhang, Shaochen Kuang, Li Zhang
Fudan University, University of Oxford, South China University of Technology, Fudan University
Abstract:
Labelling LiDAR point clouds for training autonomous driving is extremely expensive and difficult. LiDAR simulation aims at generating realistic LiDAR data with labels for training and verifying selfdriving algorithms more efficiently. Recently, Neural Radiance Fields (NeRF) have been proposed for novel view synthesis using implicit reconstruction of 3D scenes. Inspired by this, we present NeRF-LIDAR, a novel LiDAR simulation method that leverages real-world information to generate realistic LIDAR point clouds. Different from existing LiDAR simulators, we use real images and point cloud data collected by self-driving cars to learn the 3D scene representation, point cloud generation and label rendering. We verify the effectiveness of our NeRF-LiDAR by training different 3D segmentation models on the generated LiDAR point clouds. It reveals that the trained models are able to achieve similar accuracy when compared with the same model trained on the real LiDAR data. Besides, the generated data is capable of boosting the accuracy through pre-training which helps reduce the requirements of the real labeled data. Code is available at https://github.com/fudan-zvg/NeRF-LiDAR



Paperid:795
Authors:Kaiyi Zhang, Yang Chen, Ximing Yang, Weizhong Zhang, Cheng Jin
School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Innovation Center of Calligraphy and Painting Creation Technology, MCT, China, School of Computer Science, Fudan University, Shanghai, China Innovation Center of Calligraphy and Painting Creation Technology, MCT, China
Abstract:
Ideal part editing should guarantee the diversity of edited parts, the fidelity to the remaining parts, and the quality of the results. However, previous methods do not disentangle each part completely, which means the edited parts will affect the others, resulting in poor diversity and fidelity. In addition, some methods lack constraints between parts, which need manual selections of edited results to ensure quality. Therefore, we propose a fourstage process for point cloud part editing: Segmentation, Generation, Assembly, and Selection. Based on this process, we introduce SGAS, a model for part editing that employs two strategies: feature disentanglement and constraint. By independently fitting part-level feature distributions, we realize the feature disentanglement. By explicitly modeling the transformation from object-level distribution to part-level distributions, we realize the feature constraint. Considerable experiments on different datasets demonstrate the efficiency and effectiveness of SGAS on point cloud part editing. In addition, SGAS can be pruned to realize unsupervised part-aware point cloud generation and achieves state-of-the-art results.



Paperid:796
Authors:Li Zhang, Mingliang Xu, Dong Li, Jianming Du, Rujing Wang
Hefei Institute of Physical Science, Chinese Academy of Sciences, China University of Science and Technology of China, China, University of Science and Technology of China, China, University of Science and Technology of China, China, Hefei Institute of Physical Science, Chinese Academy of Sciences, China, Hefei Institute of Physical Science, Chinese Academy of Sciences, China University of Science and Technology of China, China
Abstract:
IFL (Image Forgery Location) helps secure digital media forensics. However, many methods suffer from false detections (i.e., FPs) and inaccurate boundaries. In this paper, we proposed the CatmullRom Splinesbased Regression Network (CSR-Net), which first rethinks the IFL task from the perspective of regression to deal with this problem. Specifically speaking, we propose an adaptive CutmullRom splines fitting scheme for coarse localization of the tampered regions. Then, for false positive cases, we first develop a novel re-scoring mechanism, which aims to filter out samples that cannot have responses on both the classification branch and the instance branch. Later on, to further restrict the boundaries, we design a learnable texture extraction module, which refines and enhances the contour representation by decoupling the horizontal and vertical forgery features to extract a more robust contour representation, thus suppressing FPs. Compared to segmentation-based methods, our method is simple but effective due to the unnecessity of post-processing. Extensive experiments show the superiority of CSR-Net to existing state-of-the-art methods, not only on standard natural image datasets but also on social media datasets.



Paperid:797
Authors:Lijun Zhang, Kangkang Zhou, Feng Lu, Xiang-Dong Zhou, Yu Shi
Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences, Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Chongqing School, University of Chinese Academy of Sciences
Abstract:
Most Graph Convolutional Networks based 3D human pose estimation (HPE) methods were involved in singleview 3D HPE and utilized certain spatial graphs, existing key problems such as depth ambiguity, insufficient feature representation, or limited receptive fields. To address these issues, we propose a multi-view 3D HPE framework based on deep semantic graph transformer, which adaptively learns and fuses multi-view significant semantic features of human nodes to improve 3D HPE performance. First, we propose a deep semantic graph transformer encoder to enrich spatial feature information. It deeply mines the position, spatial structure, and skeletal edge knowledge of joints and dynamically learns their correlations. Then, we build a progressive multi-view spatial-temporal feature fusion framework to mitigate joint depth uncertainty. To enhance the pose spatial representation, deep spatial semantic feature are interacted and fused across different viewpoints during monocular feature extraction. Furthermore, long-time relevant temporal dependencies are modeled and spatial-temporal information from all viewpoints is fused to intermediately supervise the depth. Extensive experiments on three 3D HPE benchmarks show that our method achieves state-of-the-art results. It can effectively enhance pose features, mitigate depth ambiguity in single-view 3D HPE, and improve 3D HPE performance without providing camera parameters. Codes and models are available at https://github.com/z0911k/SGraFormer.



Paperid:798
Authors:Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao
East China Normal University Shanghai AI Laboratory, Shanghai AI Laboratory, Shanghai AI Laboratory, East China Normal University, Shanghai AI Laboratory
Abstract:
Recently, diffusionbased image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.



Paperid:799
Authors:Mingjin Zhang, Handi Yang, Jie Guo, Yunsong Li, Xinbo Gao, Jing Zhang
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, The University of Sydney
Abstract:
Infrared Small Target Detection (IRSTD) refers to detecting faint targets in infrared images, which has achieved notable progress with the advent of deep learning. However, the drive for improved detection accuracy has led to larger, intricate models with redundant parameters, causing storage and computation inefficiencies. In this pioneering study, we introduce the concept of utilizing network pruning to enhance the efficiency of IRSTD. Due to the challenge posed by low signalto-noise ratios and the absence of detailed semantic information in infrared images, directly applying existing pruning techniques yields suboptimal performance. To address this, we propose a novel wavelet structure-regularized soft channel pruning method, giving rise to the efficient IRPruneDet model. Our approach involves representing the weight matrix in the wavelet domain and formulating a wavelet channel pruning strategy. We incorporate wavelet regularization to induce structural sparsity without incurring extra memory usage. Moreover, we design a soft channel reconstruction method that preserves important target information against premature pruning, thereby ensuring an optimal sparse structure while maintaining overall sparsity. Through extensive experiments on two widely-used benchmarks, our IRPruneDet method surpasses established techniques in both model complexity and accuracy. Specifically, when employing U-net as the baseline network, IRPruneDet achieves a 64.13% reduction in parameters and a 51.19% decrease in FLOPS, while improving IoU from 73.31% to 75.12% and nIoU from 70.92% to 74.30%. The code is available at https://github.com/hd0013/IRPruneDet.



Paperid:800
Authors:Ning Zhang, Hiuyi Cheng, Jiayu Chen, Zongyuan Jiang, Jun Huang, Yang Xue, Lianwen Jin
South China University of Technology Alibaba Group, South China University of Technology, Alibaba Group, South China University of Technology, Alibaba Group, South China University of Technology, South China University of Technology SCUT-Zhuhai Institute of Modern Industrial Innovation, Zhuhai, China
Abstract:
Document layout analysis is a crucial step for intelligent document understanding. However, many existing methods primarily focus on the visual aspects and overlook the textual features of documents. Although document pretrained models utilize multi-modal features during the pre-training phase, they tend to operate as a unimodal pipeline when it comes to layout analysis tasks. Furthermore, current multi-modal methods perform worse than unimodal detectors on complex layout analysis datasets. To address these limitations, we propose an effective and pluggable multi-modal fusion approach named M2Doc, which fuses visual and textual features for better layout detection. M2Doc contains two pluggable multi-modal fusion modules, early-fusion and late-fusion, which align and fuse visual and textual features at the pixel level and block level. Benefitting from the concision and effectiveness of M2Doc, it can be easily applied to various detectors for better layout detection, including two-stage and end-to-end object detectors. Our experimental results demonstrate significant performance improvements in detectors equipped with M2Doc on datasets such as DocLayNet (+11.3 mAP) and M6Doc (+1.9 mAP). Furthermore, through the integration of the DINO detector with M2Doc, we achieve state-of-the-art results on DocLayNet (89.0 mAP), M6Doc (69.9 mAP), and PubLayNet (95.5 mAP). The code will be publicly released at https://github.com/johnning2333/M2Doc.



Paperid:801
Authors:Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B. Chan, Hui Huang
Shenzhen University, Shenzhen University, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Shenzhen University, City University of Hong Kong, Shenzhen University
Abstract:
Recent deep learningbased multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance.



Paperid:802
Authors:Qingwang Zhang, Yingying Zhu
Shenzhen University, Shenzhen University
Abstract:
Crossview geo-localization holds significant potential for various applications, but drastic differences in viewpoints and visual appearances between cross-view images make this task extremely challenging. Recent works have made notable progress in cross-view geo-localization. However, existing methods either ignore the correspondence between geometric spatial layout in cross-view images or require high costs or strict constraints to achieve such alignment. In response to these challenges, we propose a Feature Recombination Module (FRM) that explicitly establishes the geometric spatial layout correspondences between two views. Unlike existing methods, FRM aligns geometric spatial layout by directly recombining features, avoiding image preprocessing, and introducing no additional computational and parameter costs. This effectively reduces ambiguities caused by geometric misalignments between ground-level and aerial-level images. Furthermore, it is not sensitive to frameworks and applies to both CNN-based and Transformer-based architectures. Additionally, as part of the training procedure, we also introduce a novel weighted (B+1)-tuple loss (WBL) as optimization objective. Compared to the widely used weighted soft margin ranking loss, this innovative loss enhances convergence speed and final performance. Based on the two core components (FRM and WBL), we develop an end-to-end network architecture (FRGeo) to address these limitations from a different perspective. Extensive experiments show that our proposed FRGeo not only achieves state-of-the-art performance on cross-view geo-localization benchmarks, including CVUSA, CVACT, and VIGOR, but also is significantly superior or competitive in terms of computational complexity and trainable parameters. Our project homepage is at https://zqwlearning.github.io/FRGeo.



Paperid:803
Authors:Renhong Zhang, Tianheng Cheng, Shusheng Yang, Haoyi Jiang, Shuai Zhang, Jiancheng Lyu, Xin Li, Xiaowen Ying, Dashan Gao, Wenyu Liu, Xinggang Wang
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, Qualcomm AI Research, Qualcomm AI Research, Qualcomm AI Research, Qualcomm AI Research, Qualcomm AI Research, Huazhong University of Science and Technology, Huazhong University of Science and Technology
Abstract:
Video instance segmentation on mobile devices is an important yet very challenging edge AI problem. It mainly suffers from (1) heavy computation and memory costs for frameby-frame pixel-level instance perception and (2) complicated heuristics for tracking objects. To address these issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on one single CPU core of the Snapdragon 778G Mobile Platform, without other methods of acceleration. On the COCO dataset, MobileInst achieves 31.2 mask AP and 433 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP and 30.1 AP on YouTube-VIS 2019 & 2021.



Paperid:804
Authors:Ruiyuan Zhang, Jiaxiang Liu, Zexi Li, Hao Dong, Jie Fu, Chao Wu
zhejiang university, Zhejiang University, Zhejiang University, Peking University, Hong Kong University of Science and Technology, Zhejiang University
Abstract:
Geometric fracture assembly presents a challenging practical task in archaeology and 3D computer vision. Previous methods have focused solely on assembling fragments based on semantic information, which has limited the quantity of objects that can be effectively assembled. Therefore, there is a need to develop a scalable framework for geometric fracture assembly without relying on semantic information. To improve the effectiveness of assembling geometric fractures without semantic information, we propose a cocreation space comprising several assemblers capable of gradually and unambiguously assembling fractures. Additionally, we introduce a novel loss function, i.e., the geometric-based collision loss, to address collision issues during the fracture assembly process and enhance the results. Our framework exhibits better performance on both PartNet and Breaking Bad datasets compared to existing state-of-the-art frameworks. Extensive experiments and quantitative comparisons demonstrate the effectiveness of our proposed framework, which features linear computational complexity, enhanced abstraction, and improved generalization. Our code is publicly available at https://github.com/Ruiyuan-Zhang/CCS.



Paperid:805
Authors:Sheng Zhang, Muzammal Naseer, Guangyi Chen, Zhiqiang Shen, Salman Khan, Kun Zhang, Fahad Shahbaz Khan
Mohamed Bin Zayed University of Artificial Intelligence, Mohamed Bin Zayed University of Artificial Intelligence, Mohamed Bin Zayed University of Artificial Intelligence Carnegie Mellon University, Mohamed Bin Zayed University of Artificial Intelligence, Mohamed Bin Zayed University of Artificial Intelligence Australian National University, Mohamed Bin Zayed University of Artificial Intelligence Carnegie Mellon University, Mohamed Bin Zayed University of Artificial Intelligence Linkoping University
Abstract:
Largescale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal target vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address the new problem, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR algorithm includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-train the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method substantially improves over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A.



Paperid:806
Authors:Shunran Zhang, Xiubo Zhang, Tsz Nam Chan, Shenghui Zhang, Leong Hou U
University of Macau Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, University of Macau, Shenzhen University, University of Macau, University of Macau
Abstract:
Learningbased point cloud completion tasks have shown potential in various critical tasks, such as object detection, assignment, and registration. However, accurately and efficiently quantifying the shape error between the predicted point clouds generated by networks and the ground truth remains challenging. While EMD-based loss functions excel in shape detail and perceived density distribution, their approach can only yield results with significant discrepancies from the actual EMD within a tolerable training time. To address these challenges, we first propose the initial price based on the auction algorithm, reducing the number of iterations required for the algorithm while ensuring the correctness of the assignment results. We then introduce an algorithm to compute the initial price through a successive shortest path and the Euclidean information between its nodes. Finally, we adopt a series of optimization strategies to speed up the algorithm and offer an EMD approximation scheme for point cloud problems that balances time loss and computational accuracy based on point cloud data characteristics. Our experimental results confirm that our algorithm achieves the smallest gap with the real EMD within an acceptable time range and yields the best results in end-to-end training.



Paperid:807
Authors:Taolin Zhang, Sunan He, Tao Dai, Zhi Wang, Bin Chen, Shu-Tao Xia
Tsinghua Shenzhen International Graduate School, Tsinghua University, Department of Computer Science and Engineering , Hong Kong University of Science and Technology, College of Computer Science and Software Engineering, Shenzhen University, Tsinghua Shenzhen International Graduate School, Tsinghua University, Harbin Institute of Technology, Shenzhen, Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artifcial Intelligence, Peng Cheng Laboratory
Abstract:
In recent years, vision language pretraining frameworks have made significant progress in natural language processing and computer vision, achieving remarkable performance improvement on various downstream tasks. However, when extended to point cloud data, existing works mainly focus on building task-specific models, and fail to extract universal 3D vision-language embedding that generalize well. We carefully investigate three common tasks in semantic 3D scene understanding, and derive key insights into the development of a pre-training model. Motivated by these observations, we propose a vision-language pre-training framework 3DVLP (3D vision-language pre-training with object contrastive learning), which transfers flexibly on 3D vision-language downstream tasks. 3DVLP takes visual grounding as the proxy task and introduces Object-level IoU-guided Detection (OID) loss to obtain high-quality proposals in the scene. Moreover, we design Object-level Cross-Contrastive alignment (OCC) task and Object-level Self-Contrastive learning (OSC) task to align the objects with descriptions and distinguish different objects in the scene, respectively. Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding. Code is available at https://github.com/iridescentttt/3DVLP.



Paperid:808
Authors:Tianyi Zhang, Kishore Kasichainula, Yaoxin Zhuo, Baoxin Li, Jae-Sun Seo, Yu Cao
University of Minnesota, Arizona State University, Arizona State University, Arizona State University, Cornell Tech, University of Minnesota
Abstract:
Conventional superresolution methods suffer from two drawbacks: substantial computational cost in upscaling an entire large image, and the introduction of extraneous or potentially detrimental information for downstream computer vision tasks during the refinement of the background. To solve these issues, we propose a novel transformer-based algorithm, Selective Super-Resolution (SSR), which partitions images into non-overlapping tiles, selects tiles of interest at various scales with a pyramid architecture, and exclusively reconstructs these selected tiles with deep features. Experimental results on three datasets demonstrate the efficiency and robust performance of our approach for super-resolution. Compared to the state-of-the-art methods, the FID score is reduced from 26.78 to 10.41 with 40% reduction in computation cost for the BDD100K dataset.



Paperid:809
Authors:Wenwen Zhang, Yun Hu, Hangguan Shan, Eryun Liu
Zhejiang University, ShanghaiTech University, Zhejiang University, Zhejiang University
Abstract:
Oneshot object detection (OSOD) aims to detect all object instances towards the given category specified by a query image. Most existing studies in OSOD endeavor to establish effective cross-image correlation with limited query information, however, ignoring the problems of the model bias towards the base classes and the generalization degradation on the novel classes. Observing this, we propose a novel algorithm, namely Base-class Suppression with Prior Guidance (BSPG) network to achieve bias-free OSOD. Specifically, the objects of base categories can be detected by a base-class predictor and eliminated by a base-class suppression module (BcS). Moreover, a prior guidance module (PG) is designed to calculate the correlation of high-level features in a non-parametric manner, producing a class-agnostic prior map with unbiased semantic information to guide the subsequent detection process. Equipped with the proposed two modules, we endow the model with a strong discriminative ability to distinguish the target objects from distractors belonging to the base classes. Extensive experiments show that our method outperforms the previous techniques by a large margin and achieves new state-of-the-art performance under various evaluation settings.



Paperid:810
Authors:Xin Zhang, Jinheng Xie, Yuan Yuan, Michael Bi Mi, Robby T. Tan
National University of Singapore, National University of Singapore, Huawei International Pte Ltd, Huawei International Pte Ltd, National University of Singapore
Abstract:
Unsupervised object discovery and localization aims to detect or segment objects in an image without any supervision. Recent efforts have demonstrated a notable potential to identify salient foreground objects by utilizing selfsupervised transformer features. However, their scopes only build upon patch-level features within an image, neglecting region/image-level and cross-image relationships at a broader scale. Moreover, these methods cannot differentiate various semantics from multiple instances. To address these problems, we introduce Hierarchical mErging framework via contrAstive grouPing (HEAP). Specifically, a novel lightweight head with cross-attention mechanism is designed to adaptively group intra-image patches into semantically coherent regions based on correlation among self-supervised features. Further, to ensure the distinguishability among various regions, we introduce a region-level contrastive clustering loss to pull closer similar regions across images. Also, an image-level contrastive loss is present to push foreground and background representations apart, with which foreground objects and background are accordingly discovered. HEAP facilitates efficient hierarchical image decomposition, which contributes to more accurate object discovery while also enabling differentiation among objects of various classes. Extensive experimental results on semantic segmentation retrieval, unsupervised object discovery, and saliency detection tasks demonstrate that HEAP achieves state-of-the-art performance.



Paperid:811
Authors:Xinliang Zhang, Lei Zhu, Hangzhou He, Lujia Jin, Yanye Lu
Peking University, Peking Univerisity, Peking University, Peking University, Peking University
Abstract:
Scribblebased weakly-supervised semantic segmentation using sparse scribble supervision is gaining traction as it reduces annotation costs when compared to fully annotated alternatives. Existing methods primarily generate pseudo-labels by diffusing labeled pixels to unlabeled ones with local cues for supervision. However, this diffusion process fails to exploit global semantics and class-specific cues, which are important for semantic segmentation. In this study, we propose a class-driven scribble promotion network, which utilizes both scribble annotations and pseudo-labels informed by image-level classes and global semantics for supervision. Directly adopting pseudo-labels might misguide the segmentation model, thus we design a localization rectification module to correct foreground representations in the feature space. To further combine the advantages of both supervisions, we also introduce a distance entropy loss for uncertainty reduction, which adapts per-pixel confidence weights according to the reliable region determined by the scribble and pseudo-label's boundary. Experiments on the ScribbleSup dataset with different qualities of scribble annotations outperform all the previous methods, demonstrating the superiority and robustness of our method. The code is available at https://github.com/Zxl19990529/Class-driven-Scribble-Promotion-Network.



Paperid:812
Authors:Xu Zhang, Hao Li, Mang Ye
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, Wuhan University
Abstract:
Crossmodal noise-robust learning is a challenging task since noisy correspondence is hard to recognize and rectify. Due to the cumulative and unavoidable negative impact of unresolved noise, existing methods cannot maintain a stable performance when the noise increases. In this paper, we present a novel Negative Pre-aware Cross-modal (NPC) matching solution for large visual-language model fine-tuning on noisy downstream tasks. It is featured in two aspects: (1) For noise recognition and resistance, previous methods usually directly filter out a noise subset, we propose to estimate the negative impact of each sample. It does not need additional correction mechanisms that may predict unreliable correction results, leading to self-reinforcing error. We assign a confidence weight to each sample according to its negative impact in the training process. This adaptively adjusts the contribution of each sample to avoid noisy accumulation. (2) For maintaining stable performance with increasing noise, we utilize the memorization effect of DNNs by maintaining a memory bank. Specifically, we apply GMM to select high-confident clean samples as the memory entry, where the memory entry is used to estimate the negative impact of each sample. Since clean samples are easier distinguished by GMM with increasing noise, the memory bank can still maintain high quality at a high noise ratio. Compared to the correction mechanism focusing on noise samples, memory bank-based estimation is more robust, which makes the model performance stable on noisy datasets. Extensive experiments demonstrate that our method significantly improves matching accuracy and performance stability at increasing noise ratio. Our approach also surpasses the state-of-the-art methods by a large margin. The code is available at: https://github.com/ZhangXu0963/NPC.



Paperid:813
Authors:Xulu Zhang, Xiao-Yong Wei, Jinlin Wu, Tianyi Zhang, Zhaoxiang Zhang, Zhen Lei, Qing Li
Department of Computing, the Hong Kong Polytechnic University, Hong Kong Center for Artificial Intelligence and Robotics, HKISI, CAS, Hong Kong, College of Computer Science, Sichuan University, Chengdu, China Department of Computing, the Hong Kong Polytechnic University, Hong Kong, Center for Artificial Intelligence and Robotics, HKISI, CAS, Hong Kong State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China, Department of Computing, the Hong Kong Polytechnic University, Hong Kong, Center for Artificial Intelligence and Robotics, HKISI, CAS, Hong Kong State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China School of Artificial Intelligence, UCAS, Beijing, China, Center for Artificial Intelligence and Robotics, HKISI, CAS, Hong Kong State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China School of Artificial Intelligence, UCAS, Beijing, China, Department of Computing, the Hong Kong Polytechnic University, Hong Kong
Abstract:
Inversion methods, such as Textual Inversion, generate personalized images by incorporating concepts of interest provided by user images. However, existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space. To address this issue, we propose a method that guides the inversion process towards the core distribution for compositional embeddings. Additionally, we introduce a spatial regularization approach to balance the attention on the concepts being composed. Our method is designed as a posttraining approach and can be seamlessly integrated with other inversion methods. Experimental results demonstrate the effectiveness of our proposed approach in mitigating the overfitting problem and generating more diverse and balanced compositions of concepts in the synthesized images. The source code is available at https://github.com/zhangxulu1996/Compositional-Inversion.



Paperid:814
Authors:Yachao Zhang, Runze Hu, Ronghui Li, Yanyun Qu, Yuan Xie, Xiu Li
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China, School of Information and Electronics, Beijing Institute of Technology, Beijing, 100081, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China, School of Informatics, Xiamen University, Xiamen, 361000, China, School of Computer Science and Technology, East China Normal University, Shanghai, 200062, China, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
Abstract:
Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’seye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.



Paperid:815
Authors:Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, Wanli Ouyang
University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information, The University of Sydney, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information, The University of Sydney, The University of Sydney, Shanghai AI Laboratory, Shanghai AI Laboratory, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information, Shanghai AI Laboratory
Abstract:
Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion GeneralPurpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/.



Paperid:816
Authors:Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, Zhihai He
Harbin Institute of Technology Southern University of Science and Technology, Carnegie Mellon University, Southern University of Science and Technology, Southern University of Science and Technology, Southern University of Science and Technology Pengcheng Laboratory
Abstract:
Contrastive LanguageImage Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose Concept-Guided Prompt Learning (CPL) for vision-language models. Specifically, we leverage the well-learned knowledge of CLIP to create a visual concept cache to enable conceptguided prompting. In order to refine the text features, we further develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to the current state-of-the-art methods.



Paperid:817
Authors:Yin Zhang, Yongqiang Zhang, Zian Zhang, Man Zhang, Rui Tian, Mingli Ding
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
Object detection in dark conditions has always been a great challenge due to the complex formation process of lowlight images. Currently, the mainstream methods usually adopt domain adaptation with Teacher-Student architecture to solve the dark object detection problem, and they imitate the dark conditions by using non-learnable data augmentation strategies on the annotated source daytime images. Note that these methods neglected to model the intrinsic imaging process, i.e. image signal processing (ISP), which is important for camera sensors to generate low-light images. To solve the above problems, in this paper, we propose a novel method named ISP-Teacher for dark object detection by exploring Teacher-Student architecture from a new perspective (i.e. self-supervised learning based ISP degradation). Specifically, we first design a day-to-night transformation module that consistent with the ISP pipeline of the camera sensors (ISP-DTM) to make the augmented images look more in line with the natural low-light images captured by cameras, and the ISP-related parameters are learned in a self-supervised manner. Moreover, to avoid the conflict between the ISP degradation and detection tasks in a shared encoder, we propose a disentanglement regularization (DR) that minimizes the absolute value of cosine similarity to disentangle two tasks and push two gradients vectors as orthogonal as possible. Extensive experiments conducted on two benchmarks show the effectiveness of our method in dark object detection. In particular, ISP-Teacher achieves an improvement of +2.4% AP and +3.3% AP over the SOTA method on BDD100k and SHIFT datasets, respectively. The code can be found at https://github.com/zhangyin1996/ISP-Teacher.



Paperid:818
Authors:Zhanjie Zhang, Quanwei Zhang, Wei Xing, Guangyuan Li, Lei Zhao, Jiakai Sun, Zehua Lan, Junsheng Luan, Yiling Huang, Huaizhong Lin
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
Artistic style transfer aims to repaint the content image with the learned artistic style. Existing artistic style transfer methods can be divided into two categories: small modelbased approaches and pre-trained large-scale model-based approaches. Small model-based approaches can preserve the content strucuture, but fail to produce highly realistic stylized images and introduce artifacts and disharmonious patterns; Pre-trained large-scale model-based approaches can generate highly realistic stylized images but struggle with preserving the content structure. To address the above issues, we propose ArtBank, a novel artistic style transfer framework, to generate highly realistic stylized images while preserving the content structure of the content images. Specifically, to sufficiently dig out the knowledge embedded in pre-trained large-scale models, an Implicit Style Prompt Bank (ISPB), a set of trainable parameter matrices, is designed to learn and store knowledge from the collection of artworks and behave as a visual prompt to guide pre-trained large-scale models to generate highly realistic stylized images while preserving content structure. Besides, to accelerate training the above ISPB, we propose a novel Spatial-Statistical-based self-Attention Module (SSAM). The qualitative and quantitative experiments demonstrate the superiority of our proposed method over state-of-the-art artistic style transfer methods. Code is available at https://github.com/Jamie-Cheung/ArtBank.



Paperid:819
Authors:Zhenfei Zhang, Mingyang Li, Ming-Ching Chang
University at Albany, SUNY, McGill University, University at Albany, SUNY
Abstract:
The ability to detect manipulation in multimedia data is vital in digital forensics. Existing Image Manipulation Detection (IMD) methods are mainly based on detecting anomalous features arisen from image editing or double compression artifacts. All existing IMD techniques encounter challenges when it comes to detecting small tampered regions from a large image. Moreover, compressionbased IMD approaches face difficulties in cases of double compression of identical quality factors. To investigate the State-of-The-Art (SoTA) IMD methods in those challenging conditions, we introduce a new Challenging Image Manipulation Detection (CIMD) benchmark dataset, which consists of two subsets, for evaluating editing-based and compression-based IMD methods, respectively. The dataset images were manually taken and tampered with high-quality annotations. In addition, we propose a new two-branch network model based on HRNet that can better detect both the image-editing and compression artifacts in those challenging conditions. Extensive experiments on the CIMD benchmark show that our model significantly outperforms SoTA IMD methods on CIMD. The dataset is available at: https://github.com/ZhenfeiZ/CIMD.



Paperid:820
Authors:Zheyu Zhang, Gang Yang, Yueyi Zhang, Huanjing Yue, Aiping Liu, Yunwei Ou, Jian Gong, Xiaoyan Sun
University of Science and Technology of China, Hefei 230026, China, University of Science and Technology of China, Hefei 230026, China, University of Science and Technology of China, Hefei 230026, China Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei 230088, China, Tianjin University, Tianjin 300072, China, University of Science and Technology of China, Hefei 230026, China, Beijing Tiantan Hospital, Capital Medical University, Beijing 100050, China Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei 230088, China, Beijing Tiantan Hospital, Capital Medical University, Beijing 100050, China Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei 230088, China, University of Science and Technology of China, Hefei 230026, China Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei 230088, China
Abstract:
Numerous techniques excel in brain tumor segmentation using multimodal magnetic resonance imaging (MRI) sequences, delivering exceptional results. However, the prevalent absence of modalities in clinical scenarios hampers performance. Current approaches frequently resort to zero maps as substitutes for missing modalities, inadvertently introducing feature bias and redundant computations. To address these issues, we present the Token Merging transFormer (TMFormer) for robust brain tumor segmentation with missing modalities. TMFormer tackles these challenges by extracting and merging accessible modalities into more compact token sequences. The architecture comprises two core components: the Uni-modal Token Merging Block (UMB) and the Multi-modal Token Merging Block (MMB). The UMB enhances individual modality representation by adaptively consolidating spatially redundant tokens within and outside tumor-related regions, thereby refining token sequences for augmented representational capacity. Meanwhile, the MMB mitigates multi-modal feature fusion bias, exclusively leveraging tokens from present modalities and merging them into a unified multi-modal representation to accommodate varying modality combinations. Extensive experimental results on the BraTS 2018 and 2020 datasets demonstrate the superiority and efficacy of TMFormer compared to state-of-the-art methods when dealing with missing modalities.



Paperid:821
Authors:Zhongyi Zhang, Tianyi Wei, Wenbo Zhou, Hanqing Zhao, Weiming Zhang, Nenghai Yu
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China
Abstract:
With the flourishing of the Internet, sharing one's photos or automated processing of faces using computer vision technology has become an everyday occurrence. While enjoying the convenience, the concern for identity privacy is also emerging. Therefore, some efforts introduced the concept of ``password'' from traditional cryptography such as RSA into the face anonymization and deanonymization task to protect the facial identity without compromising the usability of the face image. However, these methods either suffer from the poor visual quality of the synthesis results or do not possess the full cryptographic properties, resulting in compromised security. In this paper, we present the first facial identity cryptography framework with full properties analogous to RSA. Our framework leverages the powerful generative capabilities of StyleGAN to achieve megapixellevel facial identity anonymization and deanonymization. Thanks to the great semantic decoupling of StyleGAN's latent space, the identity encryption and decryption process are performed in latent space by a well-designed password mapper in the manner of editing latent code. Meanwhile, the password-related information is imperceptibly hidden in the edited latent code owing to the redundant nature of the latent space. To make our cryptographic framework possesses all the properties analogous to RSA, we propose three types of loss functions: single anonymization loss, sequential anonymization loss, and associated anonymization loss. Extensive experiments and ablation analyses demonstrate the superiority of our method in terms of the quality of synthesis results, identity-irrelevant attributes preservation, deanonymization accuracy, and completeness of properties analogous to RSA.



Paperid:822
Authors:Ziqiang Zhang, Yan Yan, Jing-Hao Xue, Hanzi Wang
Xiamen University, China, Xiamen University, China, University College London, UK, Xiamen University, China
Abstract:
Most existing GAN inversion methods either achieve accurate reconstruction but lack editability or offer strong editability at the cost of fidelity. Hence, how to balance the distortioneditability trade-off is a significant challenge for GAN inversion. To address this challenge, we introduce a novel spatial-contextual discrepancy information compensation-based GAN-inversion method (SDIC), which consists of a discrepancy information prediction network (DIPN) and a discrepancy information compensation network (DICN). SDIC follows a ``compensate-and-edit'' paradigm and successfully bridges the gap in image details between the original image and the reconstructed/edited image. On the one hand, DIPN encodes the multi-level spatial-contextual information of the original and initial reconstructed images and then predicts a spatial-contextual guided discrepancy map with two hourglass modules. In this way, a reliable discrepancy map that models the contextual relationship and captures fine-grained image details is learned. On the other hand, DICN incorporates the predicted discrepancy information into both the latent code and the GAN generator with different transformations, generating high-quality reconstructed/edited images. This effectively compensates for the loss of image details during GAN inversion. Both quantitative and qualitative experiments demonstrate that our proposed method achieves the excellent distortion-editability trade-off at a fast inference speed for both image inversion and editing tasks. Our code is available at https://github.com/ZzqLKED/SDIC.



Paperid:823
Authors:Ziyin Zhang, Ning Lu, Minghui Liao, Yongshuai Huang, Cheng Li, Min Wang, Wei Peng
Huawei Technologies Co., Ltd, Huawei Technologies Co., Ltd, Huawei Technologies Co., Ltd, Huawei Technologies Co., Ltd, Huawei Technologies Co., Ltd, Huawei Technologies Co., Ltd, Huawei Technologies Co., Ltd
Abstract:
Text recognition methods are gaining rapid development. Some advanced techniques, e.g., powerful modules, language models, and unand semi-supervised learning schemes, consecutively push the performance on public benchmarks forward. However, the problem of how to better optimize a text recognition model from the perspective of loss functions is largely overlooked. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with accuracy degradation. This is because CTC loss emphasizes the optimization of the entire sequence target while neglecting to learn individual characters. We propose a self-distillation scheme for CTC-based model to address this issue. It incorporates a framewise regularization term in CTC loss to emphasize individual supervision, and leverages the maximizing-a-posteriori of latent alignment to solve the inconsistency problem that arises in distillation between CTC-based models. We refer to the regularized CTC loss as Distillation Connectionist Temporal Classification (DCTC) loss. DCTC loss is module-free, requiring no extra parameters, longer inference lag, or additional training data or phases. Extensive experiments on public benchmarks demonstrate that DCTC can boost text recognition model accuracy by up to 2.6%, without any of these drawbacks.



Paperid:824
Authors:Boming Zhao, Luwei Yang, Mao Mao, Hujun Bao, Zhaopeng Cui
State Key Lab of CAD & CG, Zhejiang University, Simon Fraser University, State Key Lab of CAD & CG, Zhejiang University, State Key Lab of CAD & CG, Zhejiang University, State Key Lab of CAD & CG, Zhejiang University
Abstract:
Due to the ability to synthesize highquality novel views, Neural Radiance Fields (NeRF) has been recently exploited to improve visual localization in a known environment. However, the existing methods mostly utilize NeRF for data augmentation to improve the regression model training, and their performances on novel viewpoints and appearances are still limited due to the lack of geometric constraints. In this paper, we propose a novel visual localization framework, i.e., PNeRFLoc, based on a unified point-based representation. On one hand, PNeRFLoc supports the initial pose estimation by matching 2D and 3D feature points as traditional structure-based methods; on the other hand, it also enables pose refinement with novel view synthesis using rendering-based optimization. Specifically, we propose a novel feature adaption module to close the gaps between the features for visual localization and neural rendering. To improve the efficacy and efficiency of neural rendering-based optimization, we also developed an efficient rendering-based framework with a warping loss function. Extensive experiments demonstrate that PNeRFLoc performs the best on the synthetic dataset when the 3D NeRF model can be well learned, and significantly outperforms all the NeRF-boosted localization methods with on-par SOTA performance on the real-world benchmark localization datasets. Project webpage: https://zju3dv.github.io/PNeRFLoc/.



Paperid:825
Authors:Haimei Zhao, Qiming Zhang, Shanshan Zhao, Zhe Chen, Jing Zhang, Dacheng Tao
The University of Sydney, The University of Sydney, The University of Sydney, La Trobe University, The University of Sydney, The University of Sydney
Abstract:
Multiview camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8% mAP and 4.1% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill.



Paperid:826
Authors:Hengrun Zhao, Yu Zeng, Huchuan Lu, Lijun Wang
Dalian University of Technology, Johns Hopkins University, Dalian University of Technology, Dalian University of Technology
Abstract:
The completion of large occluded human body images poses a unique challenge for general image completion methods. The complex shape variations of human bodies make it difficult to establish a consistent understanding of their structures. Furthermore, as human vision is highly sensitive to human bodies, even slight artifacts can significantly compromise image fidelity. To address these challenges, we propose a large occluded human image completion (LOHC) model based on a novel imageprior cooperative completion strategy. Our model leverages human segmentation maps as a prior, and completes the image and prior simultaneously. Compared to the widely adopted prior-then-image completion strategy for object completion, this cooperative completion process fosters more effective interaction between the prior and image information. Our model consists of two stages. The first stage is a transformer-based auto-regressive network that predicts the overall structure of the missing area by generating a coarse completed image at a lower resolution. The second stage is a convolutional network that refines the coarse images. As the coarse result may not always be accurate, we propose a Dynamic Fusion Module (DFM) to selectively fuses the useful features from the coarse image with the original input at spatial and channel levels. Through extensive experiments, we demonstrate our method’s superior performance compared to state-of-the-art methods.



Paperid:827
Authors:Junwei Zhao, Shiliang Zhang, Zhaofei Yu, Tiejun Huang
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University Institute for Artificial Intelligence, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University Institute for Artificial Intelligence, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University Institute for Artificial Intelligence, Peking University
Abstract:
Bioinspired spike camera mimics the sampling principle of primate fovea. It presents high temporal resolution and dynamic range, showing great promise in fast-moving object recognition. However, the physical limit of CMOS technology in spike cameras still hinders their capability of recognizing ultra-high-speed moving objects, e.g., extremely fast motions cause blur during the imaging process of spike cameras. This paper presents the first theoretical analysis for the causes of spiking motion blur and proposes a robust representation that addresses this issue through temporal-spatial context learning. The proposed method leverages multi-span feature aggregation to capture temporal cues and employs residual deformable convolution to model spatial correlation among neighbouring pixels. Additionally, this paper contributes an original real-captured spiking recognition dataset consisting of 12,000 ultra-high-speed (equivalent speed > 500 km/h) moving objects. Experimental results show that the proposed method achieves 73.2% accuracy in recognizing 10 classes of ultra-high-speed moving objects, outperforming all existing spike-based recognition methods. Resources will be available at https://github.com/Evin-X/UHSR.



Paperid:828
Authors:Peizhi Zhao, Shiyi Zheng, Wenye Zhao, Dongsheng Xu, Pijian Li, Yi Cai, Qingbao Huang
School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China, School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China, School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China, School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China, School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China, School of Software Engineering, South China University of Technology Key Laboratory of Big Data and Intelligent Robot, School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China Guangxi Key Laboratory of Multimedia Communications and Network Technology
Abstract:
As a fundamental and challenging task in the vision and language domain, Referring Expression Comprehension (REC) has shown impressive improvements recently. However, for a complex task that couples the comprehension of abstract concepts and the localization of concrete instances, onestage approaches are bottlenecked by computing and data resources. To obtain a low-cost solution, the prevailing two-stage approaches decouple REC into localization (region proposal) and comprehension (region-expression matching) at region-level, but the solution based on isolated regions cannot sufficiently utilize the context and is usually limited by the quality of proposals. Therefore, it is necessary to rebuild an efficient two-stage solution system. In this paper, we propose a point-based two-stage framework for REC, in which the two stages are redefined as point-based cross-modal comprehension and point-based instance localization. Specifically, we reconstruct the raw bounding box and segmentation mask into center and mass scores as soft ground-truth for measuring point-level cross-modal correlations. With the soft ground-truth, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions on the optimization process. Remarkably, the consistent metrics between center and mass scores allow our system to directly optimize grounding and segmentation by utilizing the same architecture. Experiments on multiple benchmarks show the feasibility and potential of our point-based paradigm. Our code available at https://github.com/VILAN-Lab/PBREC-MT.



Paperid:829
Authors:Rui Zhao, Ruiqin Xiong, Jian Zhang, Xinfeng Zhang, Zhaofei Yu, Tiejun Huang
Peking University, Peking University, Peking University, University of Chinese Academy of Sciences, Peking University, Peking University
Abstract:
As an emerging neuromorphic camera with an asynchronous working mechanism, spike camera shows good potential for highspeed vision tasks. Each pixel in spike camera accumulates photons persistently and fires a spike whenever the accumulation exceeds a threshold. Such high-frequency fine-granularity photon recording facilitates the analysis and recovery of dynamic scenes with high-speed motion. This paper considers the optical flow estimation problem for spike cameras. Due to the Poisson nature of incoming photons, the occurrence of spikes is random and fluctuating, making conventional image matching inefficient. We propose a Hierarchical Spatial-Temporal (HiST) fusion module for spike representation to pursue reliable feature matching and develop a robust optical flow network, dubbed as HiST-SFlow. The HiST extracts features at multiple moments and hierarchically fuses the spatial-temporal information. We also propose an intra-moment filtering module to further extract the feature and suppress the influence of randomness in spikes. A scene loss is proposed to ensure that this hierarchical representation recovers the essential visual information in the scene. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared with the existing methods. The source codes are available at https://github.com/ruizhao26/HiST-SFlow.



Paperid:830
Authors:Ruisi Zhao, Mingming Li, Zheng Yang, Binbin Lin, Xiaohui Zhong, Xiaobo Ren, Deng Cai, Boxi Wu
State Key Lab of CAD&CG, Zhejiang University FABU Inc, State Key Lab of CAD&CG, Zhejiang University, FABU Inc, School of Software Technology, Zhejiang University, Ningbo Zhoushan Port Group Co.,Ltd., Ningbo, China, Ningbo Zhoushan Port Group Co.,Ltd., Ningbo, China, State Key Lab of CAD&CG, Zhejiang University FABU Inc, School of Software Technology, Zhejiang University
Abstract:
Human body orientation estimation (HBOE) aims to estimate the orientation of a human body relative to the camera’s frontal view. Despite recent advancements in this field, there still exist limitations in achieving finegrained results. We identify certain defects and propose corresponding approaches as follows: 1). Existing datasets suffer from non-uniform angle distributions, resulting in sparse image data for certain angles. To provide comprehensive and high-quality data, we introduce RMOS (Rendered Model Orientation Set), a rendered dataset comprising 150K accurately labeled human instances with a wide range of orientations. 2). Directly using one-hot vector as labels may overlook the similarity between angle labels, leading to poor supervision. And converting the predictions from radians to degrees enlarges the regression error. To enhance supervision, we employ Laplace smoothing to vectorize the label, which contains more information. For fine-grained predictions, we adopt weighted Smooth-L1-loss to align predictions with the smoothed-label, thus providing robust supervision. 3). Previous works ignore body-part-specific information, resulting in coarse predictions. By employing local-window self-attention, our model could utilize different body part information for more precise orientation estimations. We validate the effectiveness of our method in the benchmarks with extensive experiments and show that our method outperforms state-of-the-art. Project is available at: https://github.com/Whalesong-zrs/Towards-Fine-grained-HBOE.



Paperid:831
Authors:Tianyang Zhao, Kunwar Yashraj Singh, Srikar Appalaraju, Peng Tang, Vijay Mahadevan, R. Manmatha, Ying Nian Wu
AWS AI Labs University of California, Los Angeles, AWS AI Labs, AWS AI Labs, AWS AI Labs, AWS AI Labs, AWS AI Labs, AWS AI Labs University of California, Los Angeles
Abstract:
Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large visionlanguage (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher's first attention head instructs the Student's first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.



Paperid:832
Authors:Xinqiao Zhao, Feilong Tang, Xiaoyang Wang, Jimin Xiao
Xi'an Jiaotong-Liverpool University, Xi'an Jiaotong-Liverpool University, Xi’an Jiaotong-Liverpool University Metavisioncn, Xi'an Jiaotong-Liverpool University
Abstract:
Imagelevel weakly supervised semantic segmentation has received increasing attention due to its low annotation cost. Existing methods mainly rely on Class Activation Mapping (CAM) to obtain pseudo-labels for training semantic segmentation models. In this work, we are the first to demonstrate that long-tailed distribution in training data can cause the CAM calculated through classifier weights over-activated for head classes and under-activated for tail classes due to the shared features among head- and tail- classes. This degrades pseudo-label quality and further influences final semantic segmentation performance. To address this issue, we propose a Shared Feature Calibration (SFC) method for CAM generation. Specifically, we leverage the class prototypes which carry positive shared features and propose a Multi-Scaled Distribution-Weighted (MSDW) consistency loss for narrowing the gap between the CAMs generated through classifier weights and class prototypes during training. The MSDW loss counterbalances over-activation and under-activation by calibrating the shared features in head-/tail-class classifier weights. Experimental results show that our SFC significantly improves CAM boundaries and achieves new state-of-the-art performances. The project is available at https://github.com/Barrett-python/SFC.



Paperid:833
Authors:Zhiwei Zhao, Bin Liu, Yan Lu, Qi Chu, Nenghai Yu
School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information, School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information, Shanghai AI Laboratory, School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information, School of Cyber Science and Technology, University of Science and Technology of China CAS Key Laboratory of Electromagnetic Space Information
Abstract:
Textto-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.



Paperid:834
Authors:Zihao Zhao, Sheng Wang, Qian Wang, Dinggang Shen
School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China, School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China, School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China Shanghai Clinical Research and Trial Center, Shanghai, China, School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China Shanghai Clinical Research and Trial Center, Shanghai, China
Abstract:
Obtaining largescale radiology reports can be difficult for medical images due to ethical concerns, limiting the effectiveness of contrastive pre-training in the medical image domain and underscoring the need for alternative methods. In this paper, we propose eye-tracking as an alternative to text reports, as it allows for the passive collection of gaze signals without ethical issues. By tracking the gaze of radiologists as they read and diagnose medical images, we can understand their visual attention and clinical reasoning. When a radiologist has similar gazes for two medical images, it may indicate semantic similarity for diagnosis, and these images should be treated as positive pairs when pre-training a computer-assisted diagnosis (CAD) network through contrastive learning. Accordingly, we introduce the Medical contrastive Gaze Image Pre-training (McGIP) as a plug-and-play module for contrastive learning frameworks. McGIP uses radiologist gaze to guide contrastive pre-training. We evaluate our method using two representative types of medical images and two common types of gaze data. The experimental results demonstrate the practicality of McGIP, indicating its high potential for various clinical scenarios and applications.



Paperid:835
Authors:Bolun Zheng, Haoran Li, Quan Chen, Tingyu Wang, Xiaofei Zhou, Zhenghui Hu, Chenggang Yan
Hangzhou Dianzi University, Hangzhou Dianzi University, Hangzhou Dianzi University, Hangzhou Dianzi University Lishui Institute of Hangzhou Dianzi University, Hangzhou Dianzi University, Hangzhou Innovation Institute, Beihang University, Hangzhou Dianzi University
Abstract:
The recent imaging technology Quad Bayer CFA brings better imaging PSNR and higher visual quality compared to traditional Bayer CFA, but also serious challenges for demosaicing and denoising during the ISP pipeline. In this paper, we propose a novel dual encoder network, namely DRNet, to achieve joint demosaicing and denoising for Quad Bayer CFA. The dual encoders are carefully designed in that one is mainly constructed by a joint residual block to jointly estimate the residuals for demosaicing and denoising separately. In contrast, the other one is started with a pixel modulation block which is specially designed to match the characteristics of Quad Bayer pattern for better feature extraction. We demonstrate the effectiveness of each proposed component through detailed ablation investigations. The comparison results on public benchmarks illustrate that our DRNet achieves an apparent performance gain~(0.38dB to the 2nd best) from the stateof-the-art method and balances performance and efficiency well. The experiments on real-world images show that the proposed method could enhance the reconstruction quality from the native ISP algorithm.



Paperid:836
Authors:Huiming Zheng, Wei Gao
School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, China Peng Cheng Laboratory, China, School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen, China Peng Cheng Laboratory, China
Abstract:
As a kind of 3D data, RGBD images have been extensively used in object tracking, 3D reconstruction, remote sensing mapping, and other tasks. In the realm of computer vision, the significance of RGB-D images is progressively growing. However, the existing learning-based image compression methods usually process RGB images and depth images separately, which cannot entirely exploit the redundant information between the modalities, limiting the further improvement of the Rate-Distortion performance. With the goal of overcoming the defect, in this paper, we propose a learning-based dual-branch RGB-D image compression framework. Compared with traditional RGB domain compression scheme, a YUV domain compression scheme is presented for spatial redundancy removal. In addition, Intra-Modality Attention (IMA) and Cross-Modality Attention (CMA) are introduced for modal redundancy removal. For the sake of benefiting from cross-modal prior information, Context Prediction Module (CPM) and Context Fusion Module (CFM) are raised in the conditional entropy model which makes the context probability prediction more accurate. The experimental results demonstrate our method outperforms existing image compression methods in two RGB-D image datasets. Compared with BPG, our proposed framework can achieve up to 15% bit rate saving for RGB images.



Paperid:837
Authors:Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu
Northwestern Polytechnical University, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Northwestern Polytechnical University, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab
Abstract:
Stable diffusion, a generative model used in textto-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed HD images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2X compared to the traditional tiled algorithm. The source code is available at https://github.com/ProAirVerse/Any-Size-Diffusion.



Paperid:838
Authors:Yaolin Zheng, Hongbo Huang, Xiuying Wang, Xiaoxu Yan, Longfei Xu
Beijing Information Science and Technology University, Beijing Information Science and Technology University, Beijing Information Science and Technology University, Beijing Information Science and Technology University, Beijing Information Science and Technology University
Abstract:
Graph Convolutional Networks (GCNs) and Transformers have been widely applied to skeletonbased human action recognition, with each offering unique advantages in capturing spatial relationships and long-range dependencies. However, for most GCN methods, the construction of topological structures relies solely on the spatial information of human joints, limiting their ability to directly capture richer spatio-temporal dependencies. Additionally, the self-attention modules of many Transformer methods lack topological structure information, restricting the robustness and generalization of the models. To address these issues, we propose a Joint Trajectory Graph (JTG) that integrates spatio-temporal information into a uniform graph structure. We also present a Joint Trajectory GraphFormer (JT-GraphFormer), which directly captures the spatio-temporal relationships among all joint trajectories for human action recognition. To better integrate topological information into spatio-temporal relationships, we introduce a Spatio-Temporal Dijkstra Attention (STDA) mechanism to calculate relationship scores for all the joints in JTG. Furthermore, we incorporate the Koopman operator into the classification stage to enhance the model's representation ability and classification performance. Experiments demonstrate that JT-GraphFormer achieves outstanding performance in human action recognition tasks, outperforming state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and N-UCLA datasets.



Paperid:839
Authors:Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, Xianxian Li
Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University, Harbin Institute of Technology, Key Laboratory of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University
Abstract:
Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current topperforming trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named ODTrack, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new SOTA performance on seven benchmarks, while running at real-time speed. Code and models are available at https://github.com/GXNU-ZhongLab/ODTrack.



Paperid:840
Authors:Zewen Zheng, Xuemin Zhang, Yongqiang Mou, Xiang Gao, Chengxin Li, Guoheng Huang, Chi-Man Pun, Xiaochen Yuan
X Lab, GAC R&D CENTER, Guangdong, China Guangdong University of Technology, Guangdong, China, X Lab, GAC R&D CENTER, Guangdong, China, X Lab, GAC R&D CENTER, Guangdong, China, X Lab, GAC R&D CENTER, Guangdong, China Guangdong University of Technology, Guangdong, China, X Lab, GAC R&D CENTER, Guangdong, China South China Normal University, China, Guangdong, China, Guangdong University of Technology, Guangdong, China, University of Macau, Macau, China, Macao Polytechnic University, Macao, China
Abstract:
Monocular 3D lane detection is essential for a reliable autonomous driving system and has recently been rapidly developing. Existing popular methods mainly employ a predefined 3D anchor for lane detection based on frontviewed (FV) space, aiming to mitigate the effects of view transformations. However, the perspective geometric distortion between FV and 3D space in this FV-based approach introduces extremely dense anchor designs, which ultimately leads to confusing lane representations. In this paper, we introduce a novel prior-guided perspective on lane detection and propose an end-to-end framework named PVALane, which utilizes 2D prior knowledge to achieve precise and efficient 3D lane detection. Since 2D lane predictions can provide strong priors for lane existence, PVALane exploits FV features to generate sparse prior anchors with potential lanes in 2D space. These dynamic prior anchors help PVALane to achieve distinct lane representations and effectively improve the precision of PVALane due to the reduced lane search space. Additionally, by leveraging these prior anchors and representing lanes in both FV and bird-eye-viewed (BEV) spaces, we effectively align and merge semantic and geometric information from FV and BEV features. Extensive experiments conducted on the OpenLane and ONCE-3DLanes datasets demonstrate the superior performance of our method compared to existing state-of-the-art approaches and exhibit excellent robustness.



Paperid:841
Authors:Wenqi Zhong, Linzhi Yu, Chen Xia, Junwei Han, Dingwen Zhang
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Saccadic scanpath, a data representation of human visual behavior, has received broad interest in multiple domains. Scanpath is a complex eyetracking data modality that includes the sequences of fixation positions and fixation duration, coupled with image information. However, previous methods usually face the spatial misalignment problem of fixation features and loss of critical temporal data (including temporal correlation and fixation duration). In this study, we propose a Transformer-based scanpath model, SpFormer, to alleviate these problems. First, we propose a fixation-centric paradigm to extract the aligned spatial fixation features and tokenize the scanpaths. Then, according to the visual working memory mechanism, we design a local meta attention to reduce the semantic redundancy of fixations and guide the model to focus on the meta scanpath. Finally, we progressively integrate the duration information and fuse it with the fixation features to solve the problem of ambiguous location with the Transformer block increasing. We conduct extensive experiments on four databases under three tasks. The SpFormer establishes new state-of-the-art results in distinct settings, verifying its flexibility and versatility in practical applications. The code can be obtained from https://github.com/wenqizhong/SpFormer.



Paperid:842
Authors:Yicheng Zhong, Huawei Wei, Peiji Yang, Zhisheng Wang
Tencent Technology (Shenzhen) Company Limited, Tencent Technology (Shenzhen) Company Limited, Tencent Technology (Shenzhen) Company Limited, Tencent Technology (Shenzhen) Company Limited
Abstract:
The objective of stylized speechdriven facial animation is to create animations that encapsulate specific emotional expressions. Existing methods often depend on pre-established emotional labels or facial expression templates, which may limit the necessary flexibility for accurately conveying user intent. In this research, we introduce a technique that enables the control of arbitrary styles by leveraging natural language as emotion prompts. This technique presents benefits in terms of both flexibility and user-friendliness. To realize this objective, we initially construct a Text-Expression Alignment Dataset (TEAD), wherein each facial expression is paired with several prompt-like descriptions. We propose an innovative automatic annotation method, supported by CahtGPT, to expedite the dataset construction, thereby eliminating the substantial expense of manual annotation. Following this, we utilize TEAD to train a CLIP-based model, termed ExpCLIP, which encodes text and facial expressions into semantically aligned style embeddings. The embeddings are subsequently integrated into the facial animation generator to yield expressive and controllable facial animations. Given the limited diversity of facial emotions in existing speech-driven facial animation training data, we further introduce an effective Expression Prompt Augmentation (EPA) mechanism to enable the animation generator to support unprecedented richness in style control. Comprehensive experiments illustrate that our method accomplishes expressive facial animation generation and offers enhanced flexibility in effectively conveying the desired style.



Paperid:843
Authors:Yunshan Zhong, Yuyao Zhou, Yuxin Zhang, Fei Chao, Rongrong Ji
Institute of Artificial Intelligence, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Department of Artificial Intelligence, School of Informatics, Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Department of Artificial Intelligence, School of Informatics, Xiamen University, Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Department of Artificial Intelligence, School of Informatics, Xiamen University, Institute of Artificial Intelligence, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University Department of Artificial Intelligence, School of Informatics, Xiamen University Peng Cheng Laboratory
Abstract:
This paper focuses on addressing the issue of image demoiréing. Unlike the large volume of existing studies that rely on learning from paired real data, we attempt to learn a demoiréing model from unpaired real data, i.e., moiré images associated with irrelevant clean images. The proposed method, referred to as Unpaired Demoiréing(UnDeM), synthesizes pseudo moiré images from unpaired datasets, generating pairs with clean images for training demoiréing models. To achieve this, we divide real moiré images into patches and group them in compliance with their moiré complexity. We introduce a novel moiré generation framework to synthesize moiré images with diverse moiré features, resembling real moiré patches, and details akin to real moiréfree images. Additionally, we introduce an adaptive denoise method to eliminate the low-quality pseudo moiré images that adversely impact the learning of demoiréing models. We conduct extensive experiments on the commonly-used FHDMi and UHDM datasets. Results manifest that our UnDeM performs better than existing methods when using existing demoiréing models such as MBCNN and ESDNet-L. Code: https://github.com/zysxmu/UnDeM.



Paperid:844
Authors:Feng Zhou, Jianqin Yin, Peiyang Li
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
The "lifting from 2D pose" method has been the dominant approach to 3D Human Pose Estimation (3DHPE) due to the powerful visual analysis ability of 2D pose estimators. Widely known, there exists a depth ambiguity problem when estimating solely from 2D pose, where one 2D pose can be mapped to multiple 3D poses. Intuitively, the rich semantic and texture information in images can contribute to a more accurate "lifting" procedure. Yet, existing research encounters two primary challenges. Firstly, the distribution of image data in 3D motion capture datasets is too narrow because of the laboratorial environment, which leads to poor generalization ability of methods trained with image information. Secondly, effective strategies for leveraging image information are lacking. In this paper, we give new insight into the cause of poor generalization problems and the effectiveness of image features. Based on that, we propose an advanced framework. Specifically, the framework consists of two stages. First, we enable the keypoints to query and select the beneficial features from all image patches. To reduce the keypoints attention to inconsequential background features, we design a novel Poseguided Transformer Layer, which adaptively limits the updates to unimportant image patches. Then, through a designed Adaptive Feature Selection Module, we prune less significant image patches from the feature map. In the second stage, we allow the keypoints to further emphasize the retained critical image features. This progressive learning approach prevents further training on insignificant image features. Experimental results show that our model achieves state-of-the-art performance on both the Human3.6M dataset and the MPI-INF-3DHP dataset.



Paperid:845
Authors:Gengze Zhou, Yicong Hong, Qi Wu
University of Adelaide, Australian National University, University of Adelaide
Abstract:
Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goals, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models. Code is available at: https://github.com/GengzeZhou/NavGPT.



Paperid:846
Authors:Jiaying Zhou, Yang Liu, Qingchao Chen
National Institute of Health Data Science, Peking University, Beijing, China Institute of Medical Technology, Peking University Health Science Center, Beijing, China, Wangxuan Institute of Computer Technology, Peking University, Beijing, China, National Institute of Health Data Science, Peking University, Beijing, China Institute of Medical Technology, Peking University Health Science Center, Beijing, China National Key Laboratory of General Artificial Intelligence, Beijing, China
Abstract:
Novel class discover(NCD) aims to identify new classes undefined during model training phase with the help of knowledge of known classes. Many methods have been proposed and notably boosted performance of NCD in natural images. However, there has been no work done in discovering new classes based on medical images and disease categories, which is crucial for understanding and diagnosing specific diseases. Moreover, most of the existing methods only utilize information from image modality and use labels as the only supervisory information. In this paper, we propose a multimodal novel class discovery method based on paired images and text, inspired by the low classification accuracy of chest X-ray images and the relatively higher accuracy of the paired text. Specifically, we first pretrain the image encoder and text encoder with multi-modal contrastive learning on the entire dataset and then we generate pseudo-labels separately on the image branch and text branch. We utilize intra-modal consistency to assess the quality of pseudo-labels and adjust the weights of the pseudo-labels from both branches to generate the ultimate pseudo-labels for training. Experiments on eight subset splits of MIMIC-CXR-JPG dataset show that our method improves the clustering performance of unlabeled classes by about 10% on average compared to state-of-the-art methods. Code is available at: https://github.com/zzzzzzzzjy/MMNCD-main.



Paperid:847
Authors:Jingchun Zhou, Zongxin He, Kin-Man Lam, Yudong Wang, Weishi Zhang, Chunle Guo, Chongyi Li
Dalian Maritime University, Huizhou University, The Hong Kong Polytechnic University, Tianjin University, Dalian Maritime University, Nankai University, Nankai University
Abstract:
In this paper, we present a novel AmplitudeModulated Stochastic Perturbation and Vortex Convolutional Network, AMSP-UOD, designed for underwater object detection. AMSP-UOD specifically addresses the impact of non-ideal imaging factors on detection accuracy in complex underwater environments. To mitigate the influence of noise on object detection performance, we propose AMSP Vortex Convolution (AMSP-VConv) to disrupt the noise distribution, enhance feature extraction capabilities, effectively reduce parameters, and improve network robustness. We design the Feature Association Decoupling Cross Stage Partial (FAD-CSP) module, which strengthens the association of long and short range features, improving the network performance in complex underwater environments. Additionally, our sophisticated post-processing method, based on non-maximum suppression with aspect-ratio similarity thresholds, optimizes detection in dense scenes, such as waterweed and schools of fish, improving object detection accuracy. Extensive experiments on the URPC and RUOD datasets demonstrate that our method outperforms existing state-of-the-art methods in terms of accuracy and noise immunity. AMSP-UOD proposes an innovative solution with the potential for real-world applications. Our code is available at https://github.com/zhoujingchun03/AMSP-UOD.



Paperid:848
Authors:Qiu Zhou, Jinming Cao, Hanchao Leng, Yifang Yin, Yu Kun, Roger Zimmermann
Independent Researcher, National University of Singapore, Xiaomi Car, Institute for Infocomm Research (I2R), A*STAR, Singapore, Xiaomi Car, National University of Singapore
Abstract:
In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial. Bird's Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multiview images as input. However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view. Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods. To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset. Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems. The codes are available at: https://github.com/zhouqiu/SOGDet.



Paperid:849
Authors:Shenglong Zhou, Zhiwei Xiong, Feng Wu
University of Science and Technology of China, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
Image registration plays a crucial role in histological image analysis, encompassing tasks like multimodality fusion and disease grading. Traditional registration methods optimize objective functions for each image pair, yielding reliable accuracy but demanding heavy inference burdens. Recently, learning-based registration methods utilize networks to learn the optimization process during training and apply a one-step forward process during testing. While these methods offer promising registration performance with reduced inference time, they remain sensitive to appearance variances and local structure changes commonly encountered in histological image registration scenarios. In this paper, for the first time, we propose a novel test-time adaptation method for histological image registration, aiming to improve the generalization ability of learning-based methods. Specifically, we design two operations, style guidance and shape guidance, for the test-time adaptation process. The former leverages style representations encoded by feature statistics to address the issue of appearance variances, while the latter incorporates shape representations encoded by HOG features to improve registration accuracy in regions with structural changes. Furthermore, we consider the continuity of the model during the test-time adaptation process. Different from the previous methods initialized by a given trained model, we introduce a smoothing strategy to leverage historical models for better generalization. We conduct experiments with several representative learning-based backbones on the public histological dataset, demonstrating the superior registration performance of our test-time adaptation method.



Paperid:850
Authors:Shengzhe Zhou, Zejian Li, Shengyuan Zhang, Lefan Hou, Changyuan Yang, Guang Yang, Zhiyuan Yang, Lingyun Sun
Zhejiang University, Zhejiang University, Zhejiang University, ZheJiang University, Alibaba Group, Alibaba Group, Alibaba Group, Zhejiang University
Abstract:
Denoising Diffusion models have exhibited remarkable capabilities in image generation. However, generating highquality samples requires a large number of iterations. Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process but causes degraded generative quality. Based on our analysis with bias-variance decomposition and experimental observations, we attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model in the distillation. Accordingly, we propose Spatial Fitting-Error Reduction Distillation model (SFERD). SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. Empirically, our proposed model facilitates high-quality sample generation in a few function evaluations. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64x64 with only one step, outperforming existing diffusion methods. Our study provides a new perspective on diffusion distillation by highlighting the intrinsic denoising ability of models.



Paperid:851
Authors:Shili Zhou, Ruian He, Weimin Tan, Bo Yan
School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
Abstract:
Optical Flow Estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. Through theoretical analysis, we find the pretrained large vision models are helpful in optical flow estimation, and we notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.



Paperid:852
Authors:Yubo Zhou, Jin Lin, Fangchen Ye, Yanyun Qu, Yuan Xie
Xiamen University, Fujian, China, Xiamen University, Fujian, China, Xiamen University, Fujian, China, Xiamen University, Fujian, China, East China Normal University, Shanghai, China
Abstract:
Transformer has shown outstanding performance on image denoising, but the existing Transformer methods for image denoising are with large model sizes and high computational complexity, which is unfriendly to resourceconstrained devices. In this paper, we propose a Lightweight Image Denoising Transformer method (LIDFormer) based on Triple Multi-Dconv Head Transposed Attention (TMDTA) to boost computational efficiency. LIDFormer first implements Discrete Wavelet Transform (DWT), which transforms the input image into a low-frequency space, greatly reducing the computational complexity of image denoising. However, the low-frequency image lacks fine-feature information, which degrades the denoising performance. To handle this problem, we introduce the Complementary Periodic Feature Reusing (CPFR) scheme for aggregating the shallow-layer features and the deep-layer features. Furthermore, TMDTA is proposed to integrate global context along three dimensions, thereby enhancing the ability of global feature representation. Note that our method can be applied as a pipeline for both convolutional neural networks and Transformers. Extensive experiments on several benchmarks demonstrate that the proposed LIDFormer achieves a better trade-off between high performance and low computational complexity on real-world image denoising tasks.



Paperid:853
Authors:Yuxi Zhou, Xiujie Wang, Jianhua Zhang, Jiajia Wang, Jie Yu, Hao Zhou, Yi Gao, Shengyong Chen
Tianjin University of Technology Tsinghua University, Tianjin University of Technology, Tianjin University of Technology, Tianjin University of Technology, Tianjin University of Technology, Tianjin University of Technology, Tianjin University of Technology, Tianjin University of Technology
Abstract:
Human intention understanding in untrimmed videos aims to watch a natural video and predict what the person’s intention is. Currently, exploration of predicting human intentions in untrimmed videos is far from enough. On the one hand, untrimmed videos with mixed actions and backgrounds have a significant longtail distribution with concept drift characteristics. On the other hand, most methods can only perceive instantaneous intentions, but cannot determine the evolution of intentions. To solve the above challenges, we propose a loss based on Instance Confidence and Class Accuracy (ICCA), which aims to alleviate the prediction bias caused by the long-tail distribution with concept drift characteristics in video streams. In addition, we propose an intention-oriented evolutionary learning method to determine the intention evolution pattern (from what action to what action) and the time of evolution (when the action evolves). We conducted extensive experiments on two untrimmed video datasets (THUMOS14 and ActivityNET v1.3), and our method has achieved excellent results compared to SOTA methods. The code and supplementary materials are available at https://github.com/Jennifer123www/UntrimmedVideo.



Paperid:854
Authors:Chendi Zhu, Lujun Li, Yuli Wu, Zhengxing Sun
State Key Laboratory for Novel Software Technology, Nanjing University, The Hong Kong University of Science and Technology, Institute of Imaging and Computer Vision, RWTH Aachen University, Aachen, Germany, State Key Laboratory for Novel Software Technology, Nanjing University
Abstract:
In this paper, we present SasWOT, the first trainingfree Semantic segmentation Architecture Search (SAS) framework via an auto-discovery proxy. Semantic segmentation is widely used in many real-time applications. For fast inference and memory efficiency, Previous SAS seeks the optimal segmenter by differentiable or RL Search. However, the significant computational costs of these training-based SAS limit their practical usage. To improve the search efficiency, we explore the training-free route but empirically observe that the existing zero-cost proxies designed on the classification task are sub-optimal on the segmentation benchmark. To address this challenge, we develop a customized proxy search framework for SAS tasks to augment its predictive capabilities. Specifically, we design the proxy search space based on the some observations: (1) different inputs of segmenter statistics can be well combined; (2) some basic operators can effectively improve the correlation. Thus, we build computational graphs with multiple statistics as inputs and different advanced basis arithmetic as the primary operations to represent candidate proxies. Then, we employ an evolutionary algorithm to crossover and mutate the superior candidates in the population based on correlation evaluation. Finally, based on the searched proxy, we perform the segmenter search without candidate training. In this way, SasWOT not only enables automated proxy optimization for SAS tasks but also achieves significant search acceleration before the retrain stage. Extensive experiments on Cityscapes and CamVid datasets demonstrate that SasWOT achieves superior trade-off between accuracy and speed over several state-of-the-art techniques. More remarkably, on Cityscapes dataset, SasWOT achieves the performance of 71.3% mIoU with the speed of 162 FPS.



Paperid:855
Authors:Guangming Zhu, Siyuan Wang, Tianci Wu, Liang Zhang
School of Computer Science and Technology, Xidian University, China Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province Xi'an Key Laboratory of Intelligent Software Engineering, School of Computer Science and Technology, Xidian University, China, School of Computer Science and Technology, Xidian University, China, School of Computer Science and Technology, Xidian University, China Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province Xi'an Key Laboratory of Intelligent Software Engineering
Abstract:
Freehand sketches are appealing for humans as a universal tool to depict the visual world. Humans can recognize varied sketches of a category easily by identifying the concurrence and layout of the intrinsic semantic components of the category, since humans draw free-hand sketches based a common consensus that which types of semantic components constitute each sketch category. For example, an airplane should at least have a fuselage and wings. Based on this analysis, a semantic component-level memory module is constructed and embedded in the proposed structured sketch recognition network in this paper. The memory keys representing semantic components of each sketch category can be self-learned and enhance the recognition network's explainability. Our proposed networks can deal with different situations of sketch recognition, i.e., with or without semantic components labels of strokes. Experiments on the SPG and SketchIME datasets demonstrate the memory module's flexibility and the recognition network's explainability. The code and data are available at https://github.com/GuangmingZhu/SketchESC.



Paperid:856
Authors:Jiaying Zhu, Dong Li, Xueyang Fu, Gang Yang, Jie Huang, Aiping Liu, Zheng-Jun Zha
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China
Abstract:
This study introduces a new method for detecting and localizing image forgery by focusing on manipulation traces within the noise domain. We posit that nearly invisible noise in RGB images carries tampering traces, useful for distinguishing and locating forgeries. However, the advancement of tampering technology complicates the direct application of noise for forgery detection, as the noise inconsistency between forged and authentic regions is not fully exploited. To tackle this, we develop a twostep discriminative noise-guided approach to explicitly enhance the representation and use of noise inconsistencies, thereby fully exploiting noise information to improve the accuracy and robustness of forgery detection. Specifically, we first enhance the noise discriminability of forged regions compared to authentic ones using a de-noising network and a statistics-based constraint. Then, we merge a model-driven guided filtering mechanism with a data-driven attention mechanism to create a learnable and differentiable noise-guided filter. This sophisticated filter allows us to maintain the edges of forged regions learned from the noise. Comprehensive experiments on multiple datasets demonstrate that our method can reliably detect and localize forgeries, surpassing existing state-of-the-art methods.



Paperid:857
Authors:Juanjuan Zhu, Zhexiong Wan, Yuchao Dai
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Recently, the task of Video Frame Prediction (VFP), which predicts future video frames from previous ones through extrapolation, has made remarkable progress. However, the performance of existing VFP methods is still far from satisfactory due to the fixed framerate video used: 1) they have difficulties in handling complex dynamic scenes; 2) they cannot predict future frames with flexible prediction time intervals. The event cameras can record the intensity changes asynchronously with a very high temporal resolution, which provides rich dynamic information about the observed scenes. In this paper, we propose to predict video frames from a single image and the following events, which can not only handle complex dynamic scenes but also predict future frames with flexible prediction time intervals. First, we introduce a symmetrical crossmodal attention augmentation module to enhance the complementary information between images and events. Second, we propose to jointly achieve optical flow estimation and frame generation by combining the motion information of events and the semantic information of the image, then inpainting the holes produced by forward warping to obtain an ideal prediction frame. Based on these, we propose a lightweight pyramidal coarse-to-fine model that can predict a 720P frame within 25 ms. Extensive experiments show that our proposed model significantly outperforms the state-of-the-art frame-based and event-based VFP methods and has the fastest runtime. Code is available at https://npucvr.github.io/VFPSIE/.



Paperid:858
Authors:Lin Zhu, Xianzhang Chen, Xiao Wang, Hua Huang
Beijing Institute of Technology, Beijing Institute of Technology, Anhui University, Beijing Normal University
Abstract:
As a bioinspired vision sensor, the spike camera emulates the operational principles of the fovea, a compact retinal region, by employing spike discharges to encode the accumulation of per-pixel luminance intensity. Leveraging its high temporal resolution and bio-inspired neuromorphic design, the spike camera holds significant promise for advancing computer vision applications. Saliency detection mimic the behavior of human beings and capture the most salient region from the scenes. In this paper, we investigate the visual saliency in the continuous spike stream for the first time. To effectively process the binary spike stream, we propose a Recurrent Spiking Transformer (RST) framework, which is based on a full spiking neural network. Our framework enables the extraction of spatio-temporal features from the continuous spatio-temporal spike stream while maintaining low power consumption. To facilitate the training and validation of our proposed model, we build a comprehensive real-world spike-based visual saliency dataset, enriched with numerous light conditions. Extensive experiments demonstrate the superior performance of our Recurrent Spiking Transformer framework in comparison to other spike neural network-based methods. Our framework exhibits a substantial margin of improvement in capturing and highlighting visual saliency in the spike stream, which not only provides a new perspective for spike-based saliency segmentation but also shows a new paradigm for full SNN-based transformer models. The code and dataset are available at https://github.com/BIT-Vision/SVS.



Paperid:859
Authors:Liuwan Zhu, Rui Ning, Jiang Li, Chunsheng Xin, Hongyi Wu
University of Hawaii at Manoa, Old Dominion University, Old Dominion University, Old Dominion University, Univesity of Arizona
Abstract:
This paper proposes SEER, a novel backdoor detection algorithm for visionlanguage models, addressing the gap in the literature on multi-modal backdoor detection. While backdoor detection in single-modal models has been well studied, the investigation of such defenses in multi-modal models remains limited. Existing backdoor defense mechanisms cannot be directly applied to multi-modal settings due to their increased complexity and search space explosion. In this paper, we propose to detect backdoors in vision-language models by jointly searching image triggers and malicious target texts in feature space shared by vision and language modalities. Our extensive experiments demonstrate that SEER can achieve over 92% detection rate on backdoor detection in vision-language models in various settings without accessing training data or knowledge of downstream tasks.



Paperid:860
Authors:Shipeng Zhu, Pengfei Fang, Chenjie Zhu, Zuoyan Zhao, Qiang Xu, Hui Xue
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Abstract:
Realworld text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: https://github.com/blackprotoss/GSDM.



Paperid:861
Authors:Xingyu Zhu, Guanhui Ye, Xiapu Luo, Xuetao Wei
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen 518055, China Department of Computer Science and Engineering, Southern University of Science and Technology, China Department of Computing, Hong Kong Polytechnic University, Hong Kong, Department of Computer Science and Engineering, Southern University of Science and Technology, China, Department of Computing, Hong Kong Polytechnic University, Hong Kong, Department of Computer Science and Engineering, Southern University of Science and Technology, China Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen 518055, China
Abstract:
The goal of 3D mesh watermarking is to embed the message in 3D meshes that can withstand various attacks imperceptibly and reconstruct the message accurately from watermarked meshes. The watermarking algorithm is supposed to withstand multiple attacks, and the complexity should not grow significantly with the mesh size. Unfortunately, previous methods are less robust against attacks and lack of adaptability. In this paper, we propose a robust and adaptable deep 3D mesh watermarking Deep3DMark that leverages attentionbased convolutions in watermarking tasks to embed binary messages in vertex distributions without texture assistance. Furthermore, our Deep3DMark exploits the property that simplified meshes inherit similar relations from the original ones, where the relation is the offset vector directed from one vertex to its neighbor. By doing so, our method can be trained on simplified meshes but remains effective on large size meshes (size adaptable) and unseen categories of meshes (geometry adaptable). Extensive experiments demonstrate our method remains efficient and effective even if the mesh size is 190× increased. Under mesh attacks, Deep3DMark achieves 10%∼50% higher accuracy than traditional methods, and 2× higher SNR and 8% higher accuracy than previous DNN-based methods.



Paperid:862
Authors:Xingyu Zhu, Shuo Wang, Jinda Lu, Yanbin Hao, Haifeng Liu, Xiangnan He
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Brain-Inspired Technology Co., Ltd., University of Science and Technology of China
Abstract:
Fewshot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two samples from different categories with a blending factor. However, this mixing operation weakens the feature representation due to the linear interpolation and the overlooking of the importance of specific channels. To solve these issues, this paper proposes attentive feature regularization (AFR) which aims to improve the feature representativeness and discriminability. In our approach, we first calculate the relations between different categories of semantic labels to pick out the related features used for regularization. Then, we design two attention-based calculations at both the instance and channel levels. These calculations enable the regularization procedure to focus on two crucial aspects: the feature complementarity through adaptive interpolation in related categories and the emphasis on specific feature channels. Finally, we combine these regularization strategies to significantly improve the classifier performance. Empirical studies on several popular FSL benchmarks demonstrate the effectiveness of AFR, which improves the recognition accuracy of novel categories without the need to retrain any feature extractor, especially in the 1-shot setting. Furthermore, the proposed AFR can seamlessly integrate into other FSL methods to improve classification performance.



Paperid:863
Authors:Yu Zhu, Kang Li, Lequan Yu, Pheng Ann Heng
Department of Computer Science and Engineering, The Chinese University of Hong Kong Department of Mechanical Engineering, The University of Hong Kong, Department of Computer Science and Engineering, The Chinese University of Hong Kong, Department of Statistics and Actuarial Science, The University of Hong Kong, Department of Computer Science and Engineering, The Chinese University of Hong Kong
Abstract:
Recent studies have made remarkable progress in histopathology classification. Based on current successes, contemporary works proposed to further upgrade the model towards a more generalizable and robust direction through incrementally learning from the sequentially delivered domains. Unlike previous parameter isolation based approaches that usually demand massive computation resources during model updating, we present a memoryefficient prompt tuning framework to cultivate model generalization potential in economical memory cost. For each incoming domain, we reuse the existing parameters of the initial classification model and attach lightweight trainable prompts into it for customized tuning. Considering the domain heterogeneity, we perform decoupled prompt tuning, where we adopt a domain-specific prompt for each domain to independently investigate its distinctive characteristics, and one domain-invariant prompt shared across all domains to continually explore the common content embedding throughout time. All domain-specific prompts will be appended to the prompt bank and isolated from further changes to prevent forgetting the distinctive features of early-seen domains. While the domain-invariant prompt will be passed on and iteratively evolve by style-augmented prompt refining to improve model generalization capability over time. In specific, we construct a graph with existing prompts and build a style-augmented graph attention network to guide the domain-invariant prompt exploring the overlapped latent embedding among all delivered domains for more domain-generic representations. We have extensively evaluated our framework with two histopathology tasks, i.e., breast cancer metastasis classification and epithelium-stroma tissue classification, where our approach yielded superior performance and memory efficiency over the competing methods.



Paperid:864
Authors:Yun Zhu, Le Hui, Yaqi Shen, Jin Xie
Nanjing University of Science and Technology, Northwestern Polytechnical University, Nanjing University of Science and Technology, Nanjing University of Science and Technology
Abstract:
Current 3D object detection methods for indoor scenes mainly follow the votingand-grouping strategy to generate proposals. However, most methods utilize instance-agnostic groupings, such as ball query, leading to inconsistent semantic information and inaccurate regression of the proposals. To this end, we propose a novel superpoint grouping network for indoor anchor-free one-stage 3D object detection. Specifically, we first adopt an unsupervised manner to partition raw point clouds into superpoints, areas with semantic consistency and spatial similarity. Then, we design a geometry-aware voting module that adapts to the centerness in anchor-free detection by constraining the spatial relationship between superpoints and object centers. Next, we present a superpoint-based grouping module to explore the consistent representation within proposals. This module includes a superpoint attention layer to learn feature interaction between neighboring superpoints, and a superpoint-voxel fusion layer to propagate the superpoint-level information to the voxel level. Finally, we employ effective multiple matching to capitalize on the dynamic receptive fields of proposals based on superpoints during the training. Experimental results demonstrate our method achieves state-of-the-art performance on ScanNet V2, SUN RGB-D, and S3DIS datasets in the indoor one-stage 3D object detection. Source code is available at https://github.com/zyrant/SPGroup3D.



Paperid:865
Authors:Zhifeng Zhu, Yaochen Li, Yifan Li, Jinhuo Yang, Peijun Chen, Yuehu Liu
School of Software Engineering, Xi’an Jiaotong University, School of Software Engineering, Xi’an Jiaotong University, School of Software Engineering, Xi’an Jiaotong University, School of Software Engineering, Xi’an Jiaotong University, School of Software Engineering, Xi’an Jiaotong University, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Abstract:
For the task of unsupervised image translation, transforming the image style while preserving its original structure remains challenging. In this paper, we propose an unsupervised image translation method with structural enhancement in frequency domain named SEIT. Specifically, a frequency dynamic adaptive (FDA) module is designed for image style transformation that can well transfer the image style while maintaining its overall structure by decoupling the image content and style in frequency domain. Moreover, a waveletbased structure enhancement (WSE) module is proposed to improve the intermediate translation results by matching the high-frequency information, thus enriching the structural details. Furthermore, a multi-scale network architecture is designed to extract the domain-specific information using image-independent encoders for both the source and target domains. The extensive experimental results well demonstrate the effectiveness of the proposed method.



Paperid:866
Authors:Yiyu Zhuang, Qi Zhang, Xuan Wang, Hao Zhu, Ying Feng, Xiaoyu Li, Ying Shan, Xun Cao
Nanjing University, Tencent AI Lab, Ant Group, Nanjing University, Tencent AI Lab, Tencent AI Lab, Tencent AI Lab, Nanjing University
Abstract:
Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multiview images. However, the use of simplified lighting models such as environment maps to represent non-distant illumination, or using a network to fit indirect light modeling without a solid basis, can lead to an undesirable decomposition between lighting and material. To address this, we propose a fully differentiable framework named Neural Illumination Fields (NeIF) that uses radiance fields as a lighting model to handle complex lighting in a physically based way. Together with integral lobe encoding for roughness-adaptive specular lobe and leveraging the pre-convolved background for accurate decomposition, the proposed method represents a significant step towards integrating physically based rendering into the NeRF representation. The experiments demonstrate the superior performance of novel-view rendering compared to previous works, and the capability to re-render objects under arbitrary NeRF-style environments opens up exciting possibilities for bridging the gap between virtual and real-world scenes.



Paperid:867
Authors:Wei Zong, Yang-Wai Chow, Willy Susilo, Joonsang Baek, Jongkil Kim, Seyit Camtepe
University of Wollongong, University of Wollongong, University of Wollongong, University of Wollongong, Ewha Womans University, CSIRO Data61
Abstract:
Training Deep Neural Networks (DNNs) can be expensive when data is difficult to obtain or labeling them requires significant domain expertise. Hence, it is crucial that the Intellectual Property (IP) of DNNs trained on valuable data be protected against IP infringement. DNN fingerprinting and watermarking are two lines of work in DNN IP protection. Recently proposed DNN fingerprinting techniques are able to detect IP infringement while preserving model performance by relying on the key assumption that the decision boundaries of independently trained models are intrinsically different from one another. In contrast, DNN watermarking embeds a watermark in a model and verifies IP infringement if an identical or similar watermark is extracted from a suspect model. The techniques deployed in fingerprinting and watermarking vary significantly because their underlying mechanisms are different. From an adversary's perspective, a successful IP removal attack should defeat both fingerprinting and watermarking. However, to the best of our knowledge, there is no work on such attacks in the literature yet. In this paper, we fill this gap by presenting an IP removal attack that can defeat both fingerprinting and watermarking. We consider the challenging datafree scenario whereby all data is inverted from the victim model. Under this setting, a stolen model only depends on the victim model. Experimental results demonstrate the success of our attack in defeating state-of-the-art DNN fingerprinting and watermarking techniques. This work reveals a novel attack surface that exploits generative model inversion attacks to bypass DNN IP defenses. This threat must be addressed by future defenses for reliable IP protection.



Paperid:868
Authors:Jiayu Zou, Kun Tian, Zheng Zhu, Yun Ye, Xingang Wang
Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, PhiGent Robotics, PhiGent Robotics, Institute of Automation, Chinese Academy of Sciences
Abstract:
BEV perception is of great importance in the field of autonomous driving, serving as the cornerstone of planning, controlling, and motion prediction. The quality of the BEV feature highly affects the performance of BEV perception. However, taking the noises in camera parameters and LiDAR scans into consideration, we usually obtain BEV representation with harmful noises. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to utilize the diffusion model to get a better BEV representation. In this work, we propose an endto-end framework, named DiffBEV, to exploit the potential of diffusion model to generate a more comprehensive BEV representation. To the best of our knowledge, we are the first to apply diffusion model to BEV perception. In practice, we design three types of conditions to guide the training of the diffusion model which denoises the coarse samples and refines the semantic feature in a progressive way. What's more, a cross-attention module is leveraged to fuse the context of BEV feature and the semantic content of conditional diffusion model. DiffBEV achieves a 25.9% mIoU on the nuScenes dataset, which is 6.2% higher than the best-performing existing approach. Quantitative and qualitative results on multiple benchmarks demonstrate the effectiveness of DiffBEV in BEV semantic segmentation and 3D object detection tasks.



Paperid:869
Authors:Shinan Zou, Chao Fan, Jianbo Xiong, Chuanfu Shen, Shiqi Yu, Jin Tang
School of Automation, Central South University, Department of Computer Science and Engineering, Southern University of Science and Technology Research Institute of Trustworthy Autonomous System, Southern University of Science and Technology, School of Automation, Central South University, Department of Computer Science and Engineering, Southern University of Science and Technology The University of Hong Kong, Department of Computer Science and Engineering, Southern University of Science and Technology Research Institute of Trustworthy Autonomous System, Southern University of Science and Technology, School of Automation, Central South University
Abstract:
Gait datasets are essential for gait research. However, this paper observes that present benchmarks, whether conventional constrained or emerging realworld datasets, fall short regarding covariate diversity. To bridge this gap, we undertake an arduous 20-month effort to collect a cross-covariate gait recognition (CCGR) dataset. The CCGR dataset has 970 subjects and about 1.6 million sequences; almost every subject has 33 views and 53 different covariates. Compared to existing datasets, CCGR has both population and individual-level diversity. In addition, the views and covariates are well labeled, enabling the analysis of the effects of different factors. CCGR provides multiple types of gait data, including RGB, parsing, silhouette, and pose, offering researchers a comprehensive resource for exploration. In order to delve deeper into addressing cross-covariate gait recognition, we propose parsing-based gait recognition (ParsingGait) by utilizing the newly proposed parsing data. We have conducted extensive experiments. Our main results show: 1) Cross-covariate emerges as a pivotal challenge for practical applications of gait recognition. 2) ParsingGait demonstrates remarkable potential for further advancement. 3) Alarmingly, existing SOTA methods achieve less than 43% accuracy on the CCGR, highlighting the urgency of exploring cross-covariate gait recognition. Link: https://github.com/ShinanZou/CCGR.



Paperid:870
Authors:Siyu Zou, Jiji Tang, Yiyi Zhou, Jing He, Chaoyi Zhao, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun
Xiamen University, Fuxi AI Lab, NetEase Inc., Xiamen University, Xiamen University, Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Xiamen University
Abstract:
Diffusionbased Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing (InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times. Our code available at https://anonymous.4open.science/r/InstDiffEdit-C306



Paperid:871
Authors:Wenbin Zou, Hongxia Gao, Tian Ye, Liang Chen, Weipeng Yang, Shasha Huang, Hongsheng Chen, Sixiang Chen
The School of Automation Science and Engineering, South China University of Technology, Guangzhou, The School of Automation Science and Engineering, South China University of Technology, Guangzhou Research Center for Brain-Computer Interface, Pazhou Laboratory, Guangzhou, The Hong Kong University of Science and Technology, Guangzhou, College of Photonic and Electronic Engineering, Fujian Normal University, Fuzhou, The School of Automation Science and Engineering, South China University of Technology, Guangzhou, The School of Automation Science and Engineering, South China University of Technology, Guangzhou, The School of Automation Science and Engineering, South China University of Technology, Guangzhou, The Hong Kong University of Science and Technology, Guangzhou
Abstract:
Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting endto-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the strength of data-driven high-quality priors and strive to offer a reliable and consistent prior, circumventing the restrictions of manual priors. In this paper, we propose Clearer Night Image Restoration with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent restoration outcomes on real-world and synthetic benchmarks. To ensure the faithful restoration of details and illumination, we propose the incorporation of two essential modules: the Adaptive Illumination Enhancement Module (AIEM) and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM leverages the inter-channel correlation of features to dynamically maintain illumination consistency between degraded features and high-quality codebook features. Meanwhile, the DBCA module effectively integrates texture and structural information through bi-directional cross-attention and deformable convolution, resulting in enhanced fine-grained detail and structural fidelity across parallel decoders. Extensive experiments validate the remarkable benefits of VQCNIR in enhancing image quality under low-light conditions, showcasing its state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/AlexZou14/VQCNIR.



Paperid:872
Authors:Yang Zou, Xingyuan Li, Zhiying Jiang, Jinyuan Liu
The University of Sydney, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology
Abstract:
Neural Radiance Fields (NeRF) have made significant strides in the modeling and rendering of 3D scenes. However, due to the complexity of luminance information, existing NeRF methods often struggle to produce satisfactory renderings when dealing with high and low exposure images. To address this issue, we propose an innovative approach capable of effectively modeling and rendering images under multiple exposure conditions. Our method adaptively learns the characteristics of images under different exposure conditions through an unsupervised evaluatorsimulator structure for HDR (High Dynamic Range) fusion. This approach enhances NeRF's comprehension and handling of light variations, leading to the generation of images with appropriate brightness. Simultaneously, we present a bilevel optimization method tailored for novel view synthesis, aiming to harmonize the luminance information of input images while preserving their structural and content consistency. This approach facilitates the concurrent optimization of multi-exposure correction and novel view synthesis, in an unsupervised manner. Through comprehensive experiments conducted on the LOM and LOL datasets, our approach surpasses existing methods, markedly enhancing the task of novel view synthesis for multi-exposure environments and attaining state-of-the-art results. The source code can be found at https://github.com/Archer-204/AME-NeRF.



Paperid:873
Authors:Yanmei Zou, Hongshan Yu, Zhengeng Yang, Zechuan Li, Naveed Akhtar
Hunan University, Hunan University, Hunan Normal University, Hunan University, The University of Melbourne
Abstract:
MultiLayer Perceptron (MLP) models are the bedrock of contemporary point cloud processing. However, their complex network architectures obscure the source of their strength. We first develop an “abstraction and refinement” (ABS-REF) view for the neural modeling of point clouds. This view elucidates that whereas the early models focused on the ABS stage, the more recent techniques devise sophisticated REF stages to attain performance advantage in point cloud processing. We then borrow the concept of “positional encoding” from transformer literature, and propose a High-dimensional Positional Encoding (HPE) module, which can be readily deployed to MLP based architectures. We leverage our module to develop a suite of HPENet, which are MLP networks that follow ABS-REF paradigm, albeit with a sophisticated HPE based REF stage. The developed technique is extensively evaluated for 3D object classification, object part segmentation, semantic segmentation and object detection. We establish new state-of-the-art results of 87.6 mAcc on ScanObjectNN for object classification, and 85.5 class mIoU on ShapeNetPart for object part segmentation, and 72.7 and 78.7 mIoU on Area-5 and 6-fold experiments with S3DIS for semantic segmentation. The source code for this work is available at https://github.com/zouyanmei/HPENet.



Paperid:874
Authors:Zixin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, Song-Hai Zhang
BNRist, Tsinghua University, ARC Lab, Tencent PCG, ARC Lab, Tencent PCG, Beijing Normal University, ARC Lab, Tencent PCG, BNRist, Tsinghua University
Abstract:
Reconstructing 3D objects from extremely sparse views is a longstanding and challenging problem. While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry. In this work, we present Sparse3D, a novel 3D reconstruction method tailored for sparse view inputs. Our approach distills robust priors from a multiview-consistent diffusion model to refine a neural radiance field. Specifically, we employ a controller that harnesses epipolar features from input views, guiding a pre-trained diffusion model, such as Stable Diffusion, to produce novel-view images that maintain 3D consistency with the input. By tapping into 2D priors from powerful image diffusion models, our integrated model consistently delivers high-quality results, even when faced with open-world objects. To address the blurriness introduced by conventional SDS, we introduce the category-score distillation sampling (C-SDS) to enhance detail. We conduct experiments on CO3DV2 which is a multi-view dataset of real-world objects. Both quantitative and qualitative evaluations demonstrate that our approach outperforms previous state-of-the-art works on the metrics regarding NVS and geometry reconstruction.



Paperid:875
Authors:Fengyuan Zuo, Zhaolin Xiao, Haiyan Jin, Haonan Su
Xi'an University of Technology, China, 710048, Xi'an University of Technology, China, 710048 Shaanxi Key Laboratory for Network Computing and Security Technology, China, 710048, Xi'an University of Technology, China, 710048 Shaanxi Key Laboratory for Network Computing and Security Technology, China, 710048, Xi'an University of Technology, China, 710048 Shaanxi Key Laboratory for Network Computing and Security Technology, China, 710048
Abstract:
Accurately computing optical flow in lowcontrast and noisy dark images is challenging, especially when contour information is degraded or difficult to extract. This paper proposes CEDFlow, a latent space contour enhancement for estimating optical flow in dark environments. By leveraging spatial frequency feature decomposition, CEDFlow effectively encodes local and global motion features. Importantly, we introduce the 2nd-order Gaussian difference operation to select salient contour features in the latent space precisely. It is specifically designed for large-scale contour components essential in dark optical flow estimation. Experimental results on the FCDN and VBOF datasets demonstrate that CEDFlow outperforms state-of-the-art methods in terms of the EPE index and produces more accurate and robust flow estimation. Our code is available at: https://github.com/xautstuzfy.



Paperid:876
Authors:Vasily Alferov, Ivan Bliznets, Kirill Brilliantov
Independent Researcher, University of Groningen, St. Petersburg Department of Steklov Mathematical Institute of the RAS
Abstract:
In the paper, we study the Maximum Satisfiability and the Partial Maximum Satisfiability problems. Using Gallai–Edmonds decomposition, we significantly improve the upper bound for the Maximum Satisfiability problem parameterized above maximum matching in the variableclause graph. Our algorithm operates with a runtime of O*(2.83^k'), a substantial improvement compared to the previous approach requiring O*(4^k' ), where k' denotes the relevant parameter. Moreover, this result immediately implies O*(1.14977^m) and O*(1.27895^m) time algorithms for the (n, 3)-MaxSAT and (n, 4)-MaxSAT where m is the overall number of clauses. These upper bounds improve prior-known upper bounds equal to O*(1.1554^m) and O*(1.2872^m). We also adapt the algorithm so that it can handle instances of Partial Maximum Satisfiability without losing performance in some cases. Note that this is somewhat surprising, as the existence of even one hard clause can significantly increase the hardness of a problem.



Paperid:877
Authors:Dmitrii Avdiukhin, Vaggos Chatziafratis, Konstantin Makarychev, Grigory Yaroslavtsev
Northwestern University, University of California, Santa Cruz, Northwestern University, George Mason University
Abstract:
Motivated by applications to classification problems on metric data, we study Weighted Metric Clustering problem: given a metric d over n points and a k x k symmetric matrix A with nonnegative entries, the goal is to find a k-partition of these points into clusters C1,...,Ck, while minimizing the sum of A[i,j] * d(u,v) over all pairs of clusters Ci and Cj and all pairs of points u from Ci and v from Cj. Specific choices of A lead to Weighted Metric Clustering capturing well-studied graph partitioning problems in metric spaces, such as Min-Uncut, Min-k-Sum, Min-k-Cut, and more. Our main result is that Weighted Metric Clustering admits a polynomial-time approximation scheme (PTAS). Our algorithm handles all the above problems using the Sherali-Adams linear programming relaxation. This subsumes several prior works, unifies many of the techniques for various metric clustering objectives, and yields a PTAS for several new problems, including metric clustering on manifolds and a new family of hierarchical clustering objectives. Our experiments on the hierarchical clustering objective show that it better captures the ground-truth structural information compared to the popular Dasgupta's objective.



Paperid:878
Authors:Alessandro Betti, Michele Casoni, Marco Gori, Simone Marullo, Stefano Melacci, Matteo Tiezzi
Inria, Lab I3S, MAASAI, Université Côte d'Azur, University of Siena, University of Siena Inria, Lab I3S, MAASAI, Université Côte d'Azur, University of Florence University of Siena, University of Siena, University of Siena
Abstract:
Optimal control deals with optimization problems in which variables steer a dynamical system, and its outcome contributes to the objective function. Two classical approaches to solving these problems are Dynamic Programming and the Pontryagin Maximum Principle. In both approaches, Hamiltonian equations offer an interpretation of optimality through auxiliary variables known as costates. However, Hamiltonian equations are rarely used due to their reliance on forwardbackward algorithms across the entire temporal domain. This paper introduces a novel neural-based approach to optimal control. Neural networks are employed not only for implementing state dynamics but also for estimating costate variables. The parameters of the latter network are determined at each time step using a newly introduced local policy referred to as the time-reversed generalized Riccati equation. This policy is inspired by a result discussed in the Linear Quadratic (LQ) problem, which we conjecture stabilizes state dynamics. We support this conjecture by discussing experimental results from a range of optimal control case studies.



Paperid:879
Authors:Olaf Beyersdorff, Benjamin Böhm, Meena Mahajan
Friedrich Schiller University Jena, Friedrich Schiller University Jena, The Institute of Mathematical Sciences IMSc (a CI of Homi Bhabha National Institute HBNI)
Abstract:
Conflictdriven clause learning (CDCL) is the dominating algorithmic paradigm for SAT solving and hugely successful in practice. In its lifted version QCDCL, it is one of the main approaches for solving quantified Boolean formulas (QBF). In both SAT and QBF, proofs can be efficiently extracted from runs of (Q)CDCL solvers. While for CDCL, it is known that the proof size in the underlying proof system propositional resolution matches the CDCL runtime up to a polynomial factor, we show that in QBF there is an exponential gap between QCDCL runtime and the size of the extracted proofs in QBF resolution systems. We demonstrate that this is not just a gap between QCDCL runtime and the size of any QBF resolution proof, but even the extracted proofs are exponentially smaller for some instances. Hence searching for a small proof via QCDCL (even with non-deterministic decision policies) will provably incur an exponential overhead for some instances.



Paperid:880
Authors:Rishiraj Bhattacharyya, Sourav Chakraborty, Yash Pote, Uddalok Sarkar, Sayantan Sen
University of Birmingham, Indian Statistical Institute, National University of Singapore CREATE, Indian Statistical Institute, National University of Singapore
Abstract:
Samplers are the backbone of the implementations of any randomized algorithm. Unfortunately, obtaining an efficient algorithm to test the correctness of samplers is very hard to find. Recently, in a series of works, testers like Barbarik, Teq, Flash for testing of some particular kinds of samplers, like CNFsamplers and Horn-samplers, were obtained. However, their techniques have a significant limitation because one can not expect to use their methods to test for other samplers, such as perfect matching samplers or samplers for sampling linear extensions in posets. In this paper, we present a new testing algorithm that works for such samplers and can estimate the distance of a new sampler from a known sampler (say, the uniform sampler). Testing the identity of distributions is the heart of testing the correctness of samplers. This paper's main technical contribution is developing a new distance estimation algorithm for distributions over high-dimensional cubes using the recently proposed subcube conditioning sampling model. Given subcube conditioning access to an unknown distribution P, and a known distribution Q defined over an n-dimensional Boolean hypercube, our algorithm CubeProbeEst estimates the variation distance between P and Q within additive error using subcube conditional samples from P. Following the testing-via-learning paradigm, we also get a tester that distinguishes between the cases when P and Q are close or far in variation distance with high probability using subcube conditional samples. This estimation algorithm CubeProbeEst in the subcube conditioning sampling model helps us to design the first tester for self-reducible samplers. The correctness of the tester is formally proved. Moreover, we implement CubeProbeEst to test the quality of three samplers for sampling linear extensions in posets.



Paperid:881
Authors:Pierre Carbonnelle, Gottfried Schenner, Maurice Bruynooghe, Bart Bogaerts, Marc Denecker
KU Leuven, Siemens, KU Leuven, Vrije Universiteit Brussels, KU Leuven
Abstract:
We analyze how symmetries can be used to compress structures (also known as interpretations) onto a smaller domain without loss of information. This analysis suggests the possibility to solve satisfiability problems in the compressed domain for better performance. Thus, we propose a 2step novel method: (i) the sentence to be satisfied is automatically translated into an equisatisfiable sentence over a ``lifted'' vocabulary that allows domain compression; (ii) satisfiability of the lifted sentence is checked by growing the (initially unknown) compressed domain until a satisfying structure is found. The key issue is to ensure that this satisfying structure can always be expanded into an uncompressed structure that satisfies the original sentence to be satisfied. We present an adequate translation for sentences in typed first-order logic extended with aggregates. Our experimental evaluation shows large speedups for generative configuration problems. The method also has applications in the verification of software operating on complex data structures. Our results justify further research in automatic translation of sentences for symmetry reduction.



Paperid:882
Authors:Xingdi Chen, Yu Xiong, Kai Yang
Department of Computer Science and Technology, Tongji University, China, Department of Computer Science and Technology, Tongji University, China, Department of Computer Science and Technology, Tongji University, China Key Laboratory of Embedded System and Service Computing Ministry of Education at Tongji University Shanghai Research Institute for Intelligent Autonomous Systems
Abstract:
Utilization of interbase station cooperation for information processing has shown great potential in enhancing the overall quality of communication services (QoS) in wireless communication networks. Nevertheless, such cooperations require the knowledge of channel state information (CSI) at base stations (BSs), which is assumed to be perfectly known. However, CSI errors are inevitable in practice which necessitates beamforming technique that can achieve robust performance in the presence of channel estimation errors. Existing approaches relax the robust beamforming design problems into semidefinite programming (SDP), which can only achieve a solution that is far from being optimal. To this end, this paper views robust beamforming design problems from a bilevel optimization perspective. In particular, we focus on maximizing the worst-case weighted sum-rate (WSR) in the downlink multi-cell multi-user multiple-input single-output (MISO) system considering bounded CSI errors. We first reformulate this problem into a bilevel optimization problem and then develop an efficient algorithm based on the cutting plane method. A distributed optimization algorithm has also been developed to facilitate the parallel processing in practical settings. Numerical results are provided to confirm the effectiveness of the proposed algorithm in terms of performance and complexity, particularly in the presence of CSI uncertainties.



Paperid:883
Authors:Leroy Chew, Alexis de Colnet, Friedrich Slivovsky, Stefan Szeider
TU Wien, TU Wien, University of Liverpool, TU Wien
Abstract:
Parity reasoning is challenging for ConflictDriven Clause Learning (CDCL) SAT solvers. This has been observed even for simple formulas encoding two contradictory parity constraints with different variable orders (Chew and Heule 2020). We provide an analytical explanation for their hardness by showing that they require exponential resolution refutations with high probability when the variable order is chosen at random. We obtain this result by proving that these formulas, which are known to be Tseitin formulas, have Tseitin graphs of linear treewidth with high probability. Since such Tseitin formulas require exponential resolution refutations, our result follows. We generalize this argument to a new class of formulas that capture a basic form of parity reasoning involving a sum of two random parity constraints with random orders. Even when the variable order for the sum is chosen favorably, these formulas remain hard for resolution. In contrast, we prove that they have short DRAT refutations. We show experimentally that the running time of CDCL SAT solvers on both classes of formulas grows exponentially with their treewidth.



Paperid:884
Authors:Liang Dai, Kejie Lyu, Chengcheng Zhang, Guangming Zhao, Zhonglin Zu, Liang Wang, Bo Zheng
Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group
Abstract:
Guaranteed display (GD) advertising is a critical component of advertising since it provides publishers with stable revenue and enables advertisers to target specific audiences with guaranteed impressions. However, smooth pacing control for online ad delivery presents a challenge due to significant budget disparities, user arrival distribution drift, and dynamic change between supply and demand. This paper presents robust riskconstrained pacing (RCPacing) that utilizes Lagrangian dual multipliers to fine-tune probabilistic throttling through monotonic mapping functions within the percentile space of impression performance distribution. RCPacing combines distribution drift resilience and compatibility with guaranteed allocation mechanism, enabling us to provide near-optimal online services. We also show that RCPacing achieves O(sqrt(T)) dynamic regret where T is the length of the horizon. RCPacing's effectiveness is validated through offline evaluations and online A/B testing conducted on Taobao brand advertising platform.



Paperid:885
Authors:Yu-Wei Fan, Jie-Hong R. Jiang
National Taiwan University, National Taiwan University
Abstract:
Stochastic Boolean satisfiability (SSAT) is a natural formalism for optimization under uncertainty. Its decision version implicitly imposes a final threshold quantification on an SSAT formula. However, the single threshold quantification restricts the expressive power of SSAT. In this work, we enrich SSAT with an additional threshold quantifier, resulting in a new formalism SSAT(θ). The increased expressiveness allows SSAT(θ), which remains in the PSPACE complexity class, to subsume and encode the languages in the counting hierarchy. An SSAT(θ) solver, ClauSSat(θ), is developed. Experiments show the applicability of the solver in uniquely solving complex SSAT(θ) instances of parameter synthesis and SSAT extension.



Paperid:886
Authors:Johannes K. Fichte, Tobias Geibinger, Markus Hecher, Matthias Schlögel
Linköping University, TU Wien, Massachusetts Institute of Technology, TU Wien
Abstract:
Computational evaluations are crucial in modern problemsolving when we surpass theoretical algorithms or bounds. These experiments frequently take much work, and the sheer amount of needed resources makes it impossible to execute them on a single personal computer or laptop. Cluster schedulers allow for automatizing these tasks and scale to many computers. But, when we evaluate implementations of combinatorial algorithms, we depend on stable runtime results. Common approaches either limit parallelism or suffer from unstable runtime measurements due to interference among jobs on modern hardware. The former is inefficient and not sustainable. The latter results in unreplicable experiments. In this work, we address this issue and offer an acceptable balance between efficiency, software, hardware complexity, reliability, and replicability. We investigate effects towards replicability stability and illustrate how to efficiently use widely employed cluster resources for parallel evaluations. Furthermore, we present solutions which mitigate issues that emerge from the concurrent execution of benchmark jobs. Our experimental evaluation shows that – despite parallel execution – our approach reduces the runtime instability on the majority of instances to one second.



Paperid:887
Authors:Till Fluschnik, Leon Kellerhals, Malte Renken
TU Clausthal, TU Berlin, TU Berlin
Abstract:
We introduce the algorithmic problem of finding a locally rainbow path of length l connecting two distinguished vertices s and t in a vertexcolored directed graph. Herein, a path is locally rainbow if between any two visits of equally colored vertices, the path traverses consecutively at leaset r differently colored vertices. This problem generalizes the well-known problem of finding a rainbow path. It finds natural applications whenever there are different types of resources that must be protected from overuse, such as crop sequence optimization or production process scheduling. We show that the problem is computationally intractable even if r=2 or if one looks for a locally rainbow among the shortest paths. On the positive side, if one looks for a path that takes only a short detour (i.e., it is slightly longer than the shortest path) and if r is small, the problem can be solved efficiently. Indeed, the running time of the respective algorithm is near-optimal unless the ETH fails.



Paperid:888
Authors:Cunjing Ge
Nanjing University
Abstract:
Counting integer solutions of linear constraints has found interesting applications in various fields. It is equivalent to the problem of counting lattice points inside a polytope. However, stateof-the-art algorithms for this problem become too slow for even a modest number of variables. In this paper, we propose a new framework to approximate the lattice counts inside a polytope with a new random-walk sampling method. The counts computed by our approach has been proved approximately bounded by a (epsilon, delta)-bound. Experiments on extensive benchmarks show that our algorithm could solve polytopes with dozens of dimensions, which significantly outperforms state-of-the-art counters.



Paperid:889
Authors:Ramiz Gindullin, Nicolas Beldiceanu, Jovial Cheukam-Ngouonou, Rémi Douence, Claude-Guy Quimper
IMT Atlantique, Nantes, France LS2N, Nantes, France, IMT Atlantique, Nantes, France LS2N, Nantes, France, IMT Atlantique, Nantes, France LS2N, Nantes, France Université Laval, Quebec City, Canada, IMT Atlantique, Nantes, France LS2N, Nantes, France INRIA, Nantes, France, Université Laval, Quebec City, Canada
Abstract:
Given a table with a minimal set of input columns that functionally determines an output column, we introduce a method that tries to gradually decompose the corresponding minimal functional dependency (mfd) to acquire a formula expressing the output column in terms of the input columns. A first key element of the method is to create subproblems that are easier to solve than the original formula acquisition problem, either because it learns formulae with fewer inputs parameters, or as it focuses on formulae of a particular class, such as Boolean formulae; as a result, the acquired formulae can mix different learning biases such as polynomials, conditionals or Boolean expressions. A second key feature of the method is that it can be applied recursively to find formulae that combine polynomial, conditional or Boolean sub-terms in a nested manner. The method was tested on data for eight families of combinatorial objects; new conjectures were found that were previously unattainable. The method often creates conjectures that combine several formulae into one with a limited number of automatically found Boolean terms.



Paperid:890
Authors:Stephan Gocht, Ciaran McCreesh, Magnus O. Myreen, Jakob Nordström, Andy Oertel, Yong Kiam Tan
Lund University, Lund, Sweden University of Copenhagen, Copenhagen, Denmark, University of Glasgow, Glasgow, Scotland, Chalmers University of Technology, Gothenburg, Sweden, University of Copenhagen, Copenhagen, Denmark Lund University, Lund, Sweden, Lund University, Lund, Sweden University of Copenhagen, Copenhagen, Denmark, Institute for Infocomm Research (I2R), A*STAR, Singapore
Abstract:
Modern subgraphfinding algorithm implementations consist of thousands of lines of highly optimized code, and this complexity raises questions about their trustworthiness. Recently, some state-of-the-art subgraph solvers have been enhanced to output machine-verifiable proofs that their results are correct. While this significantly improves reliability, it is not a fully satisfactory solution, since end-users have to trust both the proof checking algorithms and the translation of the high-level graph problem into a low-level 0-1 integer linear program (ILP) used for the proofs. In this work, we present the first formally verified toolchain capable of full end-to-end verification for subgraph solving, which closes both of these trust gaps. We have built encoder frontends for various graph problems together with a 0-1 ILP (a.k.a. pseudo-Boolean) proof checker, all implemented and formally verified in the CakeML ecosystem. This toolchain is flexible and extensible, and we use it to build verified proof checkers for both decision and optimization graph problems, namely, subgraph isomorphism, maximum clique, and maximum common (connected) induced subgraph. Our experimental evaluation shows that end-to-end formal verification is now feasible for a wide range of hard graph problems.



Paperid:891
Authors:Mikoláš Janota, Choiwah Chow, João Araújo, Michael Codish, Petr Vojtěchovský
Czech Technical University in Prague, Czechia, Czech Technical University in Prague, Czechia, Center for Mathematics and Applications (NOVA Math), Portugal Department of Mathematics, NOVA FCT, Portugal, Ben-Gurion University of the Negev, Beer-Sheva, Israel, University of Denver, USA
Abstract:
This paper proposes SATbased techniques to calculate a specific normal form of a given finite mathematical structure (model). The normal form is obtained by permuting the domain elements so that the representation of the structure is lexicographically smallest possible. Such a normal form is of interest to mathematicians as it enables easy cataloging of algebraic structures. In particular, two structures are isomorphic precisely when their normal forms are the same. This form is also natural to inspect as mathematicians have been using it routinely for many decades. We develop a novel approach where a SAT solver is used in a black-box fashion to compute the smallest representative. The approach constructs the representative gradually and searches the space of possible isomorphisms, requiring a small number of variables. However, the approach may lead to a large number of SAT calls and therefore we devise propagation techniques to reduce this number. The paper focuses on finite structures with a single binary operation (encompassing groups, semigroups, etc.). However, the approach is generalizable to arbitrary finite structures. We provide an implementation of the proposed algorithm and evaluate it on a variety of algebraic structures.



Paperid:892
Authors:Javier Larrosa, Conrado Martínez, Emma Rollon
Universitat Politècnica de Catalunya, Universitat Politècnica de Catalunya, Universitat Politècnica de Catalunya
Abstract:
The Implicit Hitting Set (HS) approach has shown very effective for MaxSAT solving. However, only preliminary promising results have been obtained for the very similar Weighted CSP framework. In this paper we contribute towards both a better theoretical understanding of the HS approach and a more effective HSbased solvers for WCSP. First, we bound the minimum number of iterations of HS thanks to what we call distinguished cores. Then, we show a source of inefficiency by introducing two simple problems where HS is unfeasible. Next, we propose two reformulation methods that merge cost-functions to overcome the problem. We provide a theoretical analysis that quantifies the magnitude of the improvement of each method with respect to the number of iterations of the algorithm. In particular, we show that the reformulations can bring an exponential number of iterations down to a constant number in our working examples. Finally, we complement our theoretical analysis with two sets of experiments. First, we show that our results are aligned with real executions. Second, and most importantly, we conduct experiments on typical benchmark problems and show that cost-function merging may be heuristically applied and it may accelerate HS algorithms by several orders of magnitude. In some cases, it even outperforms state-of-the-art solvers.



Paperid:893
Authors:Kevin Leo, Grame Gange, Maria Garcia de la Banda, Mark Wallace
Department of Data Science & AI (DSAI), Monash University, Australia, Department of Data Science & AI (DSAI), Monash University, Australia, Department of Data Science & AI (DSAI), Monash University, Australia ARC Training Centre in Optimisation Technologies, Integrated Methodologies, and Applications (OPTIMA), Australia, Department of Data Science & AI (DSAI), Monash University, Australia
Abstract:
SAT and propagation solvers often underperform for optimisation models whose objective sums many singlevariable terms. MaxSAT solvers avoid this by detecting and exploiting cores: subsets of these terms that cannot collectively take their lower bounds. Previous work has shown manual analysis of cores can help define model reformulations likely to speed up solving for many model instances. This paper presents a method to automate this process. For each selected core the method identifies the instance constraints that caused it; infers the model constraints and parameters that explain how these instance constraints were formed; and learns the conditions that made those model constraint instances generate cores, while others did not. It then uses this information to reformulate the objective. The empirical evaluation shows this method can produce useful reformulations. Importantly, the method can be useful in many other situations that require explaining a set of constraints.



Paperid:894
Authors:Tianhao Liu, Shanwen Pu, Dongdong Ge, Yinyu Ye
Shanghai University of Finance and Economics, Shanghai University of Finance and Economics, Shanghai University of Finance and Economics, Stanford University
Abstract:
Linear programming has been practically solved mainly by simplex and interior point methods. Compared with the weakly polynomial complexity obtained by the interior point methods, the existence of strongly polynomial bounds for the length of the pivot path generated by the simplex methods remains a mystery. In this paper, we propose two novel pivot experts that leverage both global and local information of the linear programming instances for the primal simplex method and show their excellent performance numerically. The experts can be regarded as a benchmark to evaluate the performance of classical pivot rules, although they are hard to directly implement. To tackle this challenge, we employ a graph convolutional neural network model, trained via imitation learning, to mimic the behavior of the pivot expert. Our pivot rule, learned empirically, displays a significant advantage over conventional methods in various linear programming problems, as demonstrated through a series of rigorous experiments.



Paperid:895
Authors:Mohsen Nafar, Michael Römer
Bielefeld University, Bielefeld University
Abstract:
Offering a generic approach to obtaining both upper and lower bounds, decision diagrams (DDs) are becoming an increasingly important tool for solving discrete optimization problems. In particular, they provide a powerful and often complementary alternative to other wellknown generic bounding mechanisms such as the LP relaxation. A standard approach to employ DDs for discrete optimization is to formulate the problem as a Dynamic Program and use that formulation to compile a DD top-down in a layer-by-layer fashion. To limit the size of the resulting DD and to obtain bounds, one typically imposes a maximum width for each layer which is then enforced by either merging nodes (resulting in a so-called relaxed DD that provides a dual bound) or by dropping nodes (resulting in a so-called restricted DD that provides a primal bound). The quality of the DD bounds obtained from this top-down compilation process heavily depends on the heuristics used for the selection of the nodes to merge or drop. While it is sometimes possible to engineer problem-specific heuristics for this selection problem, the most generic approach relies on sorting the layer’s nodes based on objective function information. In this paper, we propose a generic and problem-agnostic approach that relies on clustering nodes based on the state information associated with each node. In a set of computational experiments with different knapsack and scheduling problems, we show that our approach generally outperforms the classical generic approach, and often achieves drastically better bounds both with respect to the size of the DD and the time used for compiling the DD.



Paperid:896
Authors:Anh Duc Nguyen, Tuan Dung Nguyen, Quang Minh Nguyen, Hoang H. Nguyen, Lam M. Nguyen, Kim-Chuan Toh
National University of Singapore, University of Pennsylvania, Massachusetts Institute of Technology, Georgia Tech, IBM Research, Thomas J. Watson Research Center, National University of Singapore Institute of Operations Research and Analytics, National University of Singapore
Abstract:
This paper studies the Partial Optimal Transport (POT) problem between two unbalanced measures with at most n supports and its applications in various AI tasks such as color transfer or domain adaptation. There is hence a need for fast approximations of POT with increasingly large problem sizes in arising applications. We first theoretically and experimentally investigate the infeasibility of the stateof-the-art Sinkhorn algorithm for POT, which consequently degrades its qualitative performance in real world applications like point-cloud registration. To this end, we propose a novel rounding algorithm for POT, and then provide a feasible Sinkhorn procedure with a revised computation complexity of O(n^2/epsilon^4). Our rounding algorithm also permits the development of two first-order methods to approximate the POT problem. The first algorithm, Adaptive Primal-Dual Accelerated Gradient Descent (APDAGD), finds an epsilon-approximate solution to the POT problem in O(n^2.5/epsilon). The second method, Dual Extrapolation, achieves the computation complexity of O(n^2/epsilon), thereby being the best in the literature. We further demonstrate the flexibility of POT compared to standard OT as well as the practicality of our algorithms on real applications where two marginal distributions are unbalanced.



Paperid:897
Authors:Amar Shah, Federico Mora, Sanjit A. Seshia
University of California, Berkeley, University of California, Berkeley, University of California, Berkeley
Abstract:
Algebraic data types (ADTs) are a construct classically found in functional programming languages that capture data structures like enumerated types, lists, and trees. In recent years, interest in ADTs has increased. For example, popular programming languages, like Python, have added support for ADTs. Automated reasoning about ADTs can be done using satisfiability modulo theories (SMT) solving, an extension of the Boolean satisfiability problem with firstorder logic and associated background theories. Unfortunately, SMT solvers that support ADTs do not scale as state-of-the-art approaches all use variations of the same lazy approach. In this paper, we present an SMT solver that takes a fundamentally different approach, an eager approach. Specifically, our solver reduces ADT queries to a simpler logical theory, uninterpreted functions (UF), and then uses an existing solver on the reduced query. We prove the soundness and completeness of our approach and demonstrate that it outperforms the state of the art on existing benchmarks, as well as a new, more challenging benchmark set from the planning domain.



Paperid:898
Authors:Arijit Shaw, Brendan Juba, Kuldeep S. Meel
Chennai Mathematical Institute, India IAI, TCG-CREST, Kolkata, India, Washington University in St Louis, USA, University of Toronto, Canada
Abstract:
One approach to probabilistic inference involves counting the number of models of a given Boolean formula. Here, we are interested in inferences involving higherorder objects, i.e., functions. We study the following task: Given a Boolean specification between a set of inputs and outputs, count the number of functions of inputs such that the specification is met. Such functions are called Skolem functions. We are motivated by the recent development of scalable approaches to Boolean function synthesis. This stands in relation to our problem analogously to the relationship between Boolean satisfiability and the model counting problem. Yet, counting Skolem functions poses considerable new challenges. From the complexity-theoretic standpoint, counting Skolem functions is not only #P-hard; it is quite unlikely to have an FPRAS (Fully Polynomial Randomized Approximation Scheme) as the problem of synthesizing a Skolem function remains challenging, even given access to an NP oracle. The primary contribution of this work is the first algorithm, SkolemFC, that computes the number of Skolem functions. SkolemFC relies on technical connections between counting functions and propositional model counting: our algorithm makes a linear number of calls to an approximate model counter and computes an estimate of the number of Skolem functions with theoretical guarantees. Our prototype displays impressive scalability, handling benchmarks comparably to state-of-the-art Skolem function synthesis engines, even though counting all such functions ostensibly poses a greater challenge than synthesizing a single function.



Paperid:899
Authors:Jintao Song, Wenqi Lu, Yunwen Lei, Yuchao Tang, Zhenkuan Pan, Jinming Duan
School of Computer Science, University of Birmingham College of Computer Science and Technology, Qingdao University, Department of Computing and Mathematics, Manchester Metropolitan University Centre for Computational Science and Mathematical Modelling, Coventry University, Department of Mathematics, University of Hong Kong, School of Mathematics and Information Science, Guangzhou University, College of Computer Science and Technology, Qingdao University, School of Computer Science, University of Birmingham
Abstract:
The Alternating Direction Method of Multipliers (ADMM) has gained significant attention across a broad spectrum of machine learning applications. Incorporating the overrelaxation technique shows potential for enhancing the convergence rate of ADMM. However, determining optimal algorithmic parameters, including both the associated penalty and relaxation parameters, often relies on empirical approaches tailored to specific problem domains and contextual scenarios. Incorrect parameter selection can significantly hinder ADMM's convergence rate. To address this challenge, in this paper we first propose a general approach to optimize the value of penalty parameter, followed by a novel closed-form formula to compute the optimal relaxation parameter in the context of linear quadratic problems (LQPs). We then experimentally validate our parameter selection methods through random instantiations and diverse imaging applications, encompassing diffeomorphic image registration, image deblurring, and MRI reconstruction.



Paperid:900
Authors:Giuseppe Spallitta, Roberto Sebastiani, Armin Biere
University of Trento, University of Trento, University of Freiburg
Abstract:
A basic algorithm for enumerating disjoint propositional models (disjoint AllSAT) is based on adding blocking clauses incrementally, ruling out previously found models. On the one hand, blocking clauses have the potential to reduce the number of generated models exponentially, as they can handle partial models. On the other hand, the introduction of a large number of blocking clauses affects memory consumption and drastically slows down unit propagation. We propose a new approach that allows for enumerating disjoint partial models with no need for blocking clauses by integrating: ConflictDriven Clause-Learning (CDCL), Chronological Backtracking (CB), and methods for shrinking models (Implicant Shrinking). Experiments clearly show the benefits of our novel approach.



Paperid:901
Authors:Miguel Terra-Neves, José Amaral, Alexandre Lemos, Rui Quintino, Pedro Resende, Antonio Alegria
OutSystems, OutSystems, OutSystems, OutSystems, Zharta, OutSystems
Abstract:
Graph matching is a fundamental problem in pattern recognition, with many applications such as software analysis and computational biology. One wellknown type of graph matching problem is graph isomorphism, which consists of deciding if two graphs are identical. Despite its usefulness, the properties that one may check using graph isomorphism are rather limited, since it only allows strict equality checks between two graphs. For example, it does not allow one to check complex structural properties such as if the target graph is an arbitrary length sequence followed by an arbitrary size loop. We propose a generalization of graph isomorphism that allows one to check such properties through a declarative specification. This specification is given in the form of a Regular Graph Pattern (ReGaP), a special type of graph, inspired by regular expressions, that may contain wildcard nodes that represent arbitrary structures such as variable-sized sequences or subgraphs. We propose a SAT-based algorithm for checking if a target graph matches a given ReGaP. We also propose a preprocessing technique for improving the performance of the algorithm and evaluate it through an extensive experimental evaluation on benchmarks from the CodeSearchNet dataset.



Paperid:902
Authors:Kerian Thuillier, Anne Siegel, Loïc Paulevé
Univ. Rennes, Inria, CNRS, IRISA, UMR6074, F-35000 Rennes, France, Univ. Rennes, Inria, CNRS, IRISA, UMR6074, F-35000 Rennes, France, Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France
Abstract:
Bioinformatics has always been a prolific domain for generating complex satisfiability and optimization problems. For instance, the synthesis of multiscale models of biological networks has recently been associated with the resolution of optimization problems mixing Boolean logic and universally quantified linear constraints (OPT+qLP), which can be benchmarked on real-world models. In this paper, we introduce a Counter-Example-Guided ion Refinement (CEGAR) to solve such problems efficiently. Our CEGAR exploits monotone properties inherent to linear optimization in order to generalize counter-examples of Boolean relaxations. We implemented our approach by extending Answer Set Programming (ASP) solver Clingo with a quantified linear constraints propagator. Our prototype enables exploiting independence of sub-formulas to further exploit the generalization of counter-examples. We evaluate the impact of refinement and partitioning on two sets of OPT+qLP problems inspired by system biology. Additionally, we conducted a comparison with the state-of-the-art ASP solver Clingo[lpx] that handles non-quantified linear constraints, showing the advantage of our CEGAR approach for solving large problems.



Paperid:903
Authors:Dimosthenis Tsouros, Senne Berden, Tias Guns
KU Leuven, KU Leuven, KU Leuven
Abstract:
Constraint Programming (CP) has been successfully used to model and solve complex combinatorial problems. However, modeling is often not trivial and requires expertise, which is a bottleneck to wider adoption. In Constraint Acquisition (CA), the goal is to assist the user by automatically learning the model. In (inter)active CA, this is done by interactively posting queries to the user, e.g. does this partial solution satisfy your (unspecified) constraints or not. While interactive CA methods learn the constraints, the learning is related to symbolic concept learning, as the goal is to learn an exact representation. However, a large number of queries is required to learn the model, which is a major limitation. In this paper, we aim to alleviate this limitation by tightening the connection of CA and Machine Learning (ML), by, for the first time in interactive CA, exploiting statistical ML methods. We propose to use probabilistic classification models to guide interactive CA queries to the most promising parts. We discuss how to train classifiers to predict whether a candidate expression from the bias is a constraint of the problem or not, using both relationbased and scope-based features. We then show how the predictions can be used in all layers of interactive CA: the query generation, the scope finding, and the lowest-level constraint finding. We experimentally evaluate our proposed methods using different classifiers and show that our methods greatly outperform the state of the art, decreasing the number of queries needed to converge by up to 72%.



Paperid:904
Authors:Chaoyun Wang, Jingmin Xin, Nanning Zheng, Caigui Jiang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence National Engineering Research Center for Visual Information and Applications Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Abstract:
In the context of surface representations, we find a natural structural similarity between grid surface and image data. Motivated by this inspiration, we propose a novel approach: encoding grid surfaces as geometric images and using image processing methods to address surface optimizationrelated problems. As a result, we have created the first dataset for grid surface optimization and devised a learning-based grid surface optimization network specifically tailored to geometric images, addressing the surface optimization problem through a data-driven learning of geometric constraints paradigm. We conduct extensive experiments on developable surface optimization, surface flattening, and surface denoising tasks using the designed network and datasets. The results demonstrate that our proposed method not only addresses the surface optimization problem better than traditional numerical optimization methods, especially for complex surfaces, but also boosts the optimization speed by multiple orders of magnitude. This pioneering study successfully applies deep learning methods to the field of surface optimization and provides a new solution paradigm for similar tasks, which will provide inspiration and guidance for future developments in the field of discrete surface optimization. The code and dataset are available at https://github.com/chaoyunwang/GSO-Net.



Paperid:905
Authors:Ruiwei Wang
National University of Singapore
Abstract:
Recently, the Binary Constraint Tree (BCT), a tree structured Binary Constraint Network (BCN), has been shown to be more succinct than various adhoc constraints. In this paper, we investigate the modelling power of a well-known tractable hybrid class generalizing BCT, i.e. the class of BCNs satisfying Broken Triangle Property (BTP) called BTP Networks (BTPNs). We show that the consistency checker of BTPN can be computed by polysize monotone circuit, thus, some global constraints cannot be encoded as polysize BTPN, such as the AllDifferent and Linear constraints. Then our study reveals that BTPN is strictly more succinct than the DNNF constraint and all 14 ad-hoc constraints discussed in (Wang and Yap 2023), such as the context-free grammar, BCT and smart table constraints. Furthermore, we also show that BTPN is as powerful as DNNF in terms of computing various operations and queries. In addition, we prove that it is NP-hard to determine the minimum sized BTPN encoding a constraint.



Paperid:906
Authors:Boris Wiegand, Dietrich Klakow, Jilles Vreeken
SHS - Stahl-Holding-Saar Saarland University, Saarland University, CISPA Helmholtz Center for Information Security
Abstract:
Constraint programming and AI planning are powerful tools for solving assignment, optimization, and scheduling problems. They require, however, the rarely available combination of domain knowledge and mathematical modeling expertise. Learning constraints from exemplary solutions can close this gap and alleviate the effort of modeling. Existing approaches either require extensive user interaction, need exemplary invalid solutions that must be generated by experts at great expense, or show high noisesensitivity. We aim to find constraints from potentially noisy solutions, without the need of user interaction. To this end, we formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we select the model with the best lossless compression of the data. Solving the problem involves model counting, which is #P-hard to approximate. We therefore propose the greedy URPILS algorithm to find high-quality constraints in practice. Extensive experiments on constraint programming and AI planning benchmark data show URPILS not only finds more accurate and succinct constraints, but also is more robust to noise, and has lower sample complexity than the state of the art.



Paperid:907
Authors:Hai Xia, Stefan Szeider
TU Wien, TU Wien
Abstract:
Solvers for propositional satisfiability (SAT) effectively tackle hard optimization problems. However, translating to SAT can cause a significant size increase, restricting its use to smaller instances. To mitigate this, frameworks using multiple local SAT calls for gradually improving a heuristic solution have been proposed. The performance of such algorithmic frameworks heavily relies on critical parameters, including the size of selected local instances and the time allocated per SAT call. This paper examines the automated configuration of the treewidth SATbased local improvement method (TW-SLIM) framework, which uses multiple SAT calls for computing tree decompositions of small width, a fundamental problem in combinatorial optimization. We explore various TW-SLIM configuration methods, including offline learning and real-time adjustments, significantly outperforming default settings in multi-SAT scenarios with changing problems. Building upon insights gained from offline training and real-time configurations for TW-SLIM, we propose the iterative cascading policy—a novel hybrid technique that uniquely combines both. The iterative cascading policy employs a pool of 30 configurations obtained through clustering-based offline methods, deploying them in dynamic cascades across multiple rounds. In each round, the 30 configurations are tested according to the cascading ordering, and the best tree decomposition is retained for further improvement, with the option to adjust the following ordering of cascades. This iterative approach significantly enhances the performance of TW-SLIM beyond baseline results, even within varying global timeouts. This highlights the effectiveness of the proposed iterative cascading policy in enhancing the efficiency and efficacy of complex algorithmic frameworks like TW-SLIM.



Paperid:908
Authors:Suwei Yang, Kuldeep S. Meel
National University of Singapore GrabTaxi Holdings Grab-NUS AI Lab, University of Toronto National University of Singapore
Abstract:
Model counting, a fundamental task in computer science, involves determining the number of satisfying assignments to a Boolean formula, typically represented in conjunctive normal form (CNF). While model counting for CNF formulas has received extensive attention with a broad range of applications, the study of model counting for PseudoBoolean (PB) formulas has been relatively overlooked. Pseudo-Boolean formulas, being more succinct than propositional Boolean formulas, offer greater flexibility in representing real-world problems. Consequently, there is a crucial need to investigate efficient techniques for model counting for PB formulas. In this work, we propose the first exact Pseudo-Boolean model counter, PBCount , that relies on knowledge compilation approach via algebraic decision diagrams. Our extensive empirical evaluation shows that PBCount can compute counts for 1513 instances while the current state-of-the-art approach could only handle 1013 instances. Our work opens up several avenues for future work in the context of model counting for PB formulas, such as the development of preprocessing techniques and exploration of approaches other than knowledge compilation.



Paperid:909
Authors:Haofeng Yuan, Lichang Fang, Shiji Song
Department of Automation, BNRist, Tsinghua University, Department of Automation, BNRist, Tsinghua University, Department of Automation, BNRist, Tsinghua University
Abstract:
Column generation (CG) is one of the most successful approaches for solving largescale linear programming (LP) problems. Given an LP with a prohibitively large number of variables (i.e., columns), the idea of CG is to explicitly consider only a subset of columns and iteratively add potential columns to improve the objective value. While adding the column with the most negative reduced cost can guarantee the convergence of CG, it has been shown that adding multiple columns per iteration rather than a single column can lead to faster convergence. However, it remains a challenge to design a multiple-column selection strategy to select the most promising columns from a large number of candidate columns. In this paper, we propose a novel reinforcement-learning-based (RL) multiple-column selection strategy. To the best of our knowledge, it is the first RL-based multiple-column selection strategy for CG. The effectiveness of our approach is evaluated on two sets of problems: the cutting stock problem and the graph coloring problem. Compared to several widely used single-column and multiple-column selection strategies, our RL-based multiple-column selection strategy leads to faster convergence and achieves remarkable reductions in the number of CG iterations and runtime.



Paperid:910
Authors:Qi Zhang, Yi Zhou, Ashley Prater-Bennette, Lixin Shen, Shaofeng Zou
University at Buffalo, the State University of New York, University of Utah, Air Force Research Laboratory, Syracuse University, University at Buffalo, the State University of New York
Abstract:
Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with nonconvex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes chi^2-divergences as a special case. We prove that our algorithm finds an epsilon-stationary point with an improved computational complexity than existing methods. Our method also applies to the smoothed conditional value at risk (CVaR) DRO.



Paperid:911
Authors:Jie Cai, Xin Wang, Haoyang Li, Ziwei Zhang, Wenwu Zhu
Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University
Abstract:
Multimodal graph neural architecture search (MGNAS) has shown great success for automatically designing the optimal multimodal graph neural network (MGNN) architecture by leveraging multimodal representation, crossmodal information and graph structure in one unified framework. However, existing MGNAS fails to handle distribution shifts that naturally exist in multimodal graph data, since the searched architectures inevitably capture spurious statistical correlations under distribution shifts. To solve this problem, we propose a novel Outof-distribution Generalized Multimodal Graph Neural Architecture Search (OMG-NAS) method which optimizes the MGNN architecture with respect to its performance on decorrelated OOD data. Specifically, we propose a multimodal graph representation decorrelation strategy, which encourages the searched MGNN model to output representations that eliminate spurious correlations through iteratively optimizing the feature weights and controller. In addition, we propose a global sample weight estimator that facilitates the sharing of optimal sample weights learned from existing architectures. This design promotes the effective estimation of the sample weights for candidate MGNN architectures to generate decorrelated multimodal graph representations, concentrating more on the truly predictive relations between invariant features and ground-truth labels. Extensive experiments on real-world multimodal graph datasets demonstrate the superiority of our proposed method over SOTA baselines.



Paperid:912
Authors:Shilv Cai, Liqun Chen, Sheng Zhong, Luxin Yan, Jiahuan Zhou, Xu Zou
Huazhong University of Science and Technology, Wuhan, Hubei 430074, China National Key Laboratory of Multispectral Information Intelligent Processing Technology, Wuhan, Hubei 430074, China, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China National Key Laboratory of Multispectral Information Intelligent Processing Technology, Wuhan, Hubei 430074, China, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China National Key Laboratory of Multispectral Information Intelligent Processing Technology, Wuhan, Hubei 430074, China, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China National Key Laboratory of Multispectral Information Intelligent Processing Technology, Wuhan, Hubei 430074, China, Wangxuan Institute of Computer Technology, Peking University, Beijing 100871, China, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China National Key Laboratory of Multispectral Information Intelligent Processing Technology, Wuhan, Hubei 430074, China
Abstract:
Lowlight images frequently occur due to unavoidable environmental influences or technical limitations, such as insufficient lighting or limited exposure time. To achieve better visibility for visual perception, low-light image enhancement is usually adopted. Besides, lossy image compression is vital for meeting the requirements of storage and transmission in computer vision applications. To touch the above two practical demands, current solutions can be categorized into two sequential manners: ``Compress before Enhance (CbE)'' or ``Enhance before Compress (EbC)''. However, both of them are not suitable since: (1) Error accumulation in the individual models plagues sequential solutions. Especially, once low-light images are compressed by existing general lossy image compression approaches, useful information (e.g., texture details) would be lost resulting in a dramatic performance decrease in low-light image enhancement. (2) Due to the intermediate process, the sequential solution introduces an additional burden resulting in low efficiency. We propose a novel joint solution to simultaneously achieve a high compression rate and good enhancement performance for low-light images with much lower computational cost and fewer model parameters. We design an end-to-end trainable architecture, which includes the main enhancement branch and the signal-to-noise ratio (SNR) aware branch. Experimental results show that our proposed joint solution achieves a significant improvement over different combinations of existing state-of-the-art sequential ``Compress before Enhance'' or ``Enhance before Compress'' solutions for low-light images, which would make lossy low-light image compression more meaningful. The project is publicly available at: https://github.com/CaiShilv/Joint-IC-LL.



Paperid:913
Authors:Shuzhi Cao, Jianfei Ruan, Bo Dong, Bin Shi, Qinghua Zheng
School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China, School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China, Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China School of Distance Education, Xi'an Jiaotong University, China, School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China, School of Computer Science and Technology, Xi'an Jiaotong University, China Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, China
Abstract:
Tax evasion, an unlawful practice in which taxpayers deliberately conceal information to avoid paying tax liabilities, poses significant challenges for tax authorities. Effective tax evasion detection is critical for assisting tax authorities in mitigating tax revenue loss. Recently, machinelearning-based methods, particularly those employing positive and unlabeled (PU) learning, have been adopted for tax evasion detection, achieving notable success. However, these methods exhibit two major practical limitations. First, their success heavily relies on the strong assumption that the label frequency (the fraction of identified taxpayers among tax evaders) is known in advance. Second, although some methods attempt to estimate label frequency using approaches like Mixture Proportion Estimation (MPE) without making any assumptions, they subsequently construct a classifier based on the error-prone label frequency obtained from the previous estimation. This two-stage approach may not be optimal, as it neglects error accumulation in classifier training resulting from the estimation bias in the first stage. To address these limitations, we propose a novel PU learning-based tax evasion detection framework called RR-PU, which can revise the bias in a two-stage synergistic manner. Specifically, RR-PU refines the label frequency initialization by leveraging a regrouping technique to fortify the MPE perspective. Subsequently, we integrate a trainable slack variable to fine-tune the initial label frequency, concurrently optimizing this variable and the classifier to eliminate latent bias in the initial stage. Experimental results on three real-world tax datasets demonstrate that RR-PU outperforms state-of-the-art methods in tax evasion detection tasks.



Paperid:914
Authors:Yuwei Cao, Hao Peng, Zhengtao Yu, Philip S. Yu
University of Illinois Chicago, Beihang University, Kunming University of Science and Technology, University of Illinois Chicago
Abstract:
As a trending approach for social event detection, graph neural network (GNN)based methods enable a fusion of natural language semantics and the complex social network structural information, thus showing SOTA performance. However, GNN-based methods can miss useful message correlations. Moreover, they require manual labeling for training and predetermining the number of events for prediction. In this work, we address social event detection via graph structural entropy (SE) minimization. While keeping the merits of the GNN-based methods, the proposed framework, HISEvent, constructs more informative message graphs, is unsupervised, and does not require the number of events given a priori. Specifically, we incrementally explore the graph neighborhoods using 1-dimensional (1D) SE minimization to supplement the existing message graph with edges between semantically related messages. We then detect events from the message graph by hierarchically minimizing 2-dimensional (2D) SE. Our proposed 1D and 2D SE minimization algorithms are customized for social event detection and effectively tackle the efficiency problem of the existing SE minimization algorithms. Extensive experiments show that HISEvent consistently outperforms GNN-based methods and achieves the new SOTA for social event detection under both closed- and open-set settings while being efficient and robust.



Paperid:915
Authors:Shreyas Chaudhari, David Arbour, Georgios Theocharous, Nikos Vlassis
University of Massachusetts Amherst, Adobe Research, Adobe Research, Adobe Research
Abstract:
Recommendation strategies are typically evaluated by using previously logged data, employing offpolicy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.



Paperid:916
Authors:Jiayuan Chen, Kehan Guo, Zhen Liu, Olexandr Isayev, Xiangliang Zhang
The Ohio State University, University of Notre Dame, Carnegie Mellon University, Carnegie Mellon University, University of Notre Dame
Abstract:
Predicting chemical reaction yields is pivotal for efficient chemical synthesis, an area that focuses on the creation of novel compounds for diverse uses. Yield prediction demands accurate representations of reactions for forecasting practical transformation rates. Yet, the uncertainty issues broadcasting in realworld situations prohibit current models to excel in this task owing to the high sensitivity of yield activities and the uncertainty in yield measurements. Existing models often utilize single-modal feature representations, such as molecular fingerprints, SMILES sequences, or molecular graphs, which is not sufficient to capture the complex interactions and dynamic behavior of molecules in reactions. In this paper, we present an advanced Uncertainty-Aware Multimodal model (UAM) to tackle these challenges. Our approach seamlessly integrates data sources from multiple modalities by encompassing sequence representations, molecular graphs, and expert-defined chemical reaction features for a comprehensive representation of reactions. Additionally, we address both the model and data-based uncertainty, refining the model's predictive capability. Extensive experiments on three datasets, including two high throughput experiment (HTE) datasets and one chemist-constructed Amide coupling reaction dataset, demonstrate that UAM outperforms the state-of-the-art methods. The code and used datasets are available at https://github.com/jychen229/Multimodal-reaction-yield-prediction.



Paperid:917
Authors:Junyang Chen, Guoxuan Zou, Pan Zhou, Wu Yirui, Zhenghan Chen, Houcheng Su, Huan Wang, Zhiguo Gong
Shenzhen Univeristy, Shenzhen University, Huazhong University of Science and Technology, Hohai University, Microsoft, University of Macau, Huazhong Agricultural University, University of Macau
Abstract:
Sequential Recommendation plays a significant role in daily recommendation systems, such as ecommerce platforms like Amazon and Taobao. However, even with the advent of large models, these platforms often face sparse issues in the historical browsing records of individual users due to new users joining or the introduction of new products. As a result, existing sequence recommendation algorithms may not perform well. To address this, sequence-based data augmentation methods have garnered attention. Existing sequence enhancement methods typically rely on augmenting existing data, employing techniques like cropping, masking prediction, random reordering, and random replacement of the original sequence. While these methods have shown improvements, they often overlook the exploration of the deep embedding space of the sequence. To tackle these challenges, we propose a Sparse Enhanced Network (SparseEnNet), which is a robust adversarial generation method. SparseEnNet aims to fully explore the hidden space in sequence recommendation, generating more robust enhanced items. Additionally, we adopt an adversarial generation method, allowing the model to differentiate between data augmentation categories and achieve better prediction performance for the next item in the sequence. Experiments have demonstrated that our method achieves a remarkable 4-14% improvement over existing methods when evaluated on the real-world datasets. (https://github.com/junyachen/SparseEnNet)



Paperid:918
Authors:Lanlan Chen, Kai Wu, Jian Lou, Jing Liu
Guangzhou Institute of Technology, Xidian University, School of Artificial Intelligence, Xidian University, Zhejiang University, Guangzhou Institute of Technology, Xidian University
Abstract:
Modeling continuoustime dynamics constitutes a foundational challenge, and uncovering inter-component correlations within complex systems holds promise for enhancing the efficacy of dynamic modeling. The prevailing approach of integrating graph neural networks with ordinary differential equations has demonstrated promising performance. However, they disregard the crucial signed information potential on graphs, impeding their capacity to accurately capture real-world phenomena and leading to subpar outcomes. In response, we introduce a novel approach: a signed graph neural ordinary differential equation, adeptly addressing the limitations of miscapturing signed information. Our proposed solution boasts both flexibility and efficiency. To substantiate its effectiveness, we seamlessly integrate our devised strategies into three preeminent graph-based dynamic modeling frameworks: graph neural ordinary differential equations, graph neural controlled differential equations, and graph recurrent neural networks. Rigorous assessments encompass three intricate dynamic scenarios from physics and biology, as well as scrutiny across four authentic real-world traffic datasets. Remarkably outperforming the trio of baselines, empirical results underscore the substantial performance enhancements facilitated by our proposed approach. Our code can be found at https://github.com/beautyonce/SGODE.



Paperid:919
Authors:Yankai Chen, Yixiang Fang, Qiongyan Wang, Xin Cao, Irwin King
The Chinese University of Hong Kong, The Chinese University of Hong Kong, Shenzhen, University of Copenhagen, University of New South Wales, The Chinese University of Hong Kong
Abstract:
The classic problem of node importance estimation has been conventionally studied with homogeneous network topology analysis. To deal with practical network heterogeneity, a few recent methods employ graph neural models to automatically learn diverse sources of information. However, the major concern revolves around that their fully adaptive learning process may lead to insufficient information exploration, thereby formulating the problem as the isolated node value prediction with underperformance and less interpretability. In this work, we propose a novel learning framework namely SKES. Different from previous automatic learning designs, SKES exploits heterogeneous structural knowledge to enrich the informativeness of node representations. Then based on a sufficiently uninformative reference, SKES estimates the importance value for any input node, by quantifying its informativeness disparity against the reference. This establishes an interpretable node importance computation paradigm. Furthermore, SKES dives deep into the understanding that "nodes with similar characteristics are prone to have similar importance values" whilst guaranteeing that such informativeness disparity between any different nodes is orderly reflected by the embedding distance of their associated latent features. Extensive experiments on three widelyevaluated benchmarks demonstrate the performance superiority of SKES over several recent competing methods.



Paperid:920
Authors:Zhen Chen, Dalin Zhang, Shanshan Feng, Kaixuan Chen, Lisi Chen, Peng Han, Shuo Shang
University of Electronic Science and Technology of China, Aalborg University, Denmark, Centre for Frontier AI Research, A*STAR, Singapore Institute of High-Performance Computing, A*STAR, Singapore, Aalborg University, Denmark, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Trajectory similarity computation serves as a fundamental functionality of various spatial information applications. Although existing deep learning similarity computation methods offer better efficiency and accuracy than nonlearning solutions, they are still immature in trajectory embedding and suffer from poor generality and heavy preprocessing for training. Targeting these limitations, we propose a novel framework named KGTS based on knowledge graph grid embedding, prompt trajectory embedding, and unsupervised contrastive learning for improved trajectory similarity computation. Specifically, we first embed map grids with a GRot embedding method to vigorously grasp the neighbouring relations of grids. Then, a prompt trajectory embedding network incorporates the resulting grid embedding and extracts trajectory structure and point order information. It is trained by unsupervised contrastive learning, which not only alleviates the heavy preprocessing burden but also provides exceptional generality with creatively designed strategies for positive sample generation. The prompt trajectory embedding adopts a customized prompt paradigm to mitigate the gap between the grid embedding and the trajectory embedding. Extensive experiments on two real-world trajectory datasets demonstrate the superior performance of KGTS over state-of-the-art methods.



Paperid:921
Authors:Zhengyu Chen, Teng Xiao, Kun Kuang, Zheqi Lv, Min Zhang, Jinluan Yang, Chengqiang Lu, Hongxia Yang, Fei Wu
Institute of Artificial Intelligence, Zhejiang University Shanghai Institute for Advanced Study, Zhejiang University, The Pennsylvania State University, Institute of Artificial Intelligence, Zhejiang University Shanghai Institute for Advanced Study, Zhejiang University, Institute of Artificial Intelligence, Zhejiang University Shanghai Institute for Advanced Study, Zhejiang University, Institute of Artificial Intelligence, Zhejiang University Shanghai Institute for Advanced Study, Zhejiang University, Institute of Artificial Intelligence, Zhejiang University Shanghai Institute for Advanced Study, Zhejiang University, DAMA Academy, Alibaba Group, DAMA Academy, Alibaba Group, Institute of Artificial Intelligence, Zhejiang University Shanghai Institute for Advanced Study, Zhejiang University
Abstract:
Graph Neural Networks (GNNs) show promising results for graph tasks. However, existing GNNs' generalization ability will degrade when there exist distribution shifts between testing and training graph data. The fundamental reason for the severe degeneration is that most GNNs are designed based on the I.I.D hypothesis. In such a setting, GNNs tend to exploit subtle statistical correlations existing in the training set for predictions, even though it is a spurious correlation. In this paper, we study the problem of the generalization ability of GNNs on OutOf-Distribution (OOD) settings. To solve this problem, we propose the Learning to Reweight for Generalizable Graph Neural Network (L2R-GNN) to enhance the generalization ability for achieving satisfactory performance on unseen testing graphs that have different distributions with training graphs. We propose a novel nonlinear graph decorrelation method, which can substantially improve the out-of-distribution generalization ability and compares favorably to previous methods in restraining the over-reduced sample size. The variables of graph representation are clustered based on the stability of their correlations, and graph decorrelation method learns weights to remove correlations between the variables of different clusters rather than any two variables. Besides, we introduce an effective stochastic algorithm based on bi-level optimization for the L2R-GNN framework, which enables simultaneously learning the optimal weights and GNN parameters, and avoids the over-fitting issue. Experiments show that L2R-GNN greatly outperforms baselines on various graph prediction benchmarks under distribution shifts.



Paperid:922
Authors:Hui Cui, Lihai Zhao, Fengling Li, Lei Zhu, Xiaohui Han, Jingjing Li
Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, University of Science and Technology Beijing, University of Technology Sydney, Tongji University, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences) Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science Quan Cheng Laboratory, University of Electronic Science and Technology of China
Abstract:
Unsupervised domain adaptive hashing is a highly promising research direction within the field of retrieval. It aims to transfer valuable insights from the source domain to the target domain while maintaining high storage and retrieval efficiency. Despite its potential, this field remains relatively unexplored. Previous methods usually lead to unsatisfactory retrieval performance, as they frequently directly apply slightly modified domain adaptation algorithms to hash learning framework, or pursue domain alignment within the Hamming space characterized by limited semantic information. In this paper, we propose a simple yet effective approach named Comparative Prototype Hashing (CPH) for unsupervised domain adaptive image retrieval. We establish a domainshared unit hypersphere space through prototype contrastive learning and then obtain the Hamming hypersphere space via mapping from the shared hypersphere. This strategy achieves a cohesive synergy between learning uniformly distributed and category conflict-averse feature representations, eliminating domain discrepancies, and facilitating hash code learning. Moreover, by leveraging dual-domain information to supervise the entire hashing model training process, we can generate hash codes that retain inter-sample similarity relationships within both domains. Experimental results validate that our CPH significantly outperforms the state-of-the-art counterparts across multiple cross-domain and single-domain retrieval tasks. Notably, on Office-Home and Office-31 datasets, CPH achieves an average performance improvement of 19.29% and 13.85% on cross-domain retrieval tasks compared to the second-best results, respectively. The source codes of our method are available at: https://github.com/christinecui/CPH.



Paperid:923
Authors:Wanyun Cui, Linqiu Zhang
Shanghai University of Finance and Economics, Shanghai University of Finance and Economics
Abstract:
The ability to combine multiple pieces of existing knowledge to infer new knowledge is both crucial and challenging. In this paper, we explore how facts of various entities are combined in the context of knowledge graph completion (KGC). We use composite reasoning to unify the views from different KGC models, including translational models, tensor factorization (TF)based models, instance-based learning models, and KGC regularizers. Moreover, our comprehensive examination of composite reasoning revealed an unexpected phenomenon: certain TF-based models learn embeddings with erroneous composite reasoning, which ultimately violates their fundamental collaborative filtering assumption and reduces their effects. This motivates us to reduce their composition error. Empirical evaluations demonstrate that mitigating the composition risk not only enhances the performance of TF-based models across all tested settings, but also surpass or is competitive with the state-of-the-art performance on two out of four benchmarks.



Paperid:924
Authors:Joscha Cüppers, Paul Krieger, Jilles Vreeken
CISPA Helmholtz Center for Information Security, Saarland University, CISPA Helmholtz Center for Information Security
Abstract:
Summarizing sequential data with serial episodes allows nontrivial insight into the data generating process. Existing methods penalize gaps in pattern occurrences equally, regardless of where in the pattern these occur. This results in a strong bias against patterns with long inter-event delays, and in addition that regularity in terms of delays is not rewarded or discovered---even though both aspects provide key insight. In this paper we tackle both these problems by explicitly modeling inter-event delay distributions. That is, we are not only interested in discovering the patterns, but also in describing how many times steps typically occur between their individual events. We formalize the problem in terms of the Minimum Description Length principle, by which we say the best set of patterns is the one that compresses the data best. The resulting optimization problem does not lend itself to exact optimization, and hence we propose Hopper to heuristically mine high quality patterns. Extensive experiments show that Hopper efficiently recovers the ground truth, discovers meaningful patterns from real-world data, and outperforms existing methods in discovering long-delay patterns.



Paperid:925
Authors:Yiqi Dong, Dongxiao He, Xiaobao Wang, Youzhu Jin, Meng Ge, Carl Yang, Di Jin
School of New Media and Communication, Tianjin University, Tianjin, China, School of New Media and Communication, Tianjin University, Tianjin, China Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China, Beijing-Dublin International College, Beijing University of Technology, Beijing, China, Saw Swee Hock School of Public Health, National University of Singapore, Singapore, Department of Computer Science, Emory University, Georgia, USA, School of New Media and Communication, Tianjin University, Tianjin, China Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Abstract:
In the current Internet landscape, the rampant spread of fake news, particularly in the form of multimodal content, poses a great social threat. While automatic multi-modal fake news detection methods have shown promising results, the lack of explainability remains a significant challenge. Existing approaches provide superficial explainability by displaying learned important components or views from well-trained networks, but they often fail to uncover the implicit deceptive patterns that reveal how fake news is fabricated. To address this limitation, we begin by predefining three typical deceptive patterns, namely image manipulation, cross-modal inconsistency, and image repurposing, which shed light on the mechanisms underlying fake news fabrication. Then, we propose a novel Neuro-Symbolic Latent Model called NSLM, that not only derives accurate judgments on the veracity of news but also uncovers the implicit deceptive patterns as explanations. Specifically, the existence of each deceptive pattern is expressed as a two-valued learnable latent variable, which is acquired through amortized variational inference and weak supervision based on symbolic logic rules. Additionally, we devise pseudo-siamese networks to capture distinct deceptive patterns effectively. Experimental results on two real-world datasets demonstrate that our NSLM achieves the best performance in fake news detection while providing insightful explanations of deceptive patterns.



Paperid:926
Authors:Yingpeng Du, Di Luo, Rui Yan, Xiaopei Wang, Hongzhi Liu, Hengshu Zhu, Yang Song, Jie Zhang
School of Computer Science and Engineering, Nanyang Technological University, Singapore School of Software and Microelectronics, Peking University, Beijing, China, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China, School of Languages and Communication Studies, Beijing Jiaotong University, Beijing, China, School of Software and Microelectronics, Peking University, Beijing, China, Career Science Lab, BOSS Zhipin, Beijing, China, NLP Center, BOSS Zhipin, Beijing, China, School of Computer Science and Engineering, Nanyang Technological University, Singapore
Abstract:
Recommending suitable jobs to users is a critical task in online recruitment platforms. While existing job recommendation methods encounter challenges such as the low quality of users' resumes, which hampers their accuracy and practical effectiveness.With the rapid development of large language models (LLMs), utilizing the rich external knowledge encapsulated within them, as well as their powerful reasoning capabilities, is a promising way to complete users' resumes for more accurate recommendations. However, directly leveraging LLMs to enhance recommendation results is not a onesize-fits-all solution, as LLMs may suffer from fabricated generation and few-shot problems, which degrade the quality of resume completion. In this paper, we propose a novel LLM-based approach for job recommendation. To alleviate the limitation of fabricated generation for LLMs, we extract accurate and valuable information beyond users' self-description, which helps the LLMs better profile users for resume completion. Specifically, we not only extract users' explicit properties (e.g., skills, interests) from their self-description but also infer users' implicit characteristics from their behaviors for more accurate and meaningful resume completion. Nevertheless, some users still suffer from few-shot problems, which arise due to scarce interaction records, leading to limited guidance for high-quality resume generation. To address this issue, we propose aligning unpaired low-quality with high-quality generated resumes by Generative Adversarial Networks (GANs), which can refine the resume representations for better recommendation results. Extensive experiments on three large real-world recruitment datasets demonstrate the effectiveness of our proposed method.



Paperid:927
Authors:Liang Duan, Xiang Chen, Wenjie Liu, Daliang Liu, Kun Yue, Angsheng Li
Yunnan University, Yunnan University, Yunnan University, Yunnan University, Yunnan University, Yunnan University Beihang University
Abstract:
As one of the most common tasks in graph data analysis, node classification is frequently solved by using graph structure learning (GSL) techniques to optimize graph structures and learn suitable graph neural networks. Most of the existing GSL methods focus on fusing different structural features (basic views) extracted from the graph, but very little graph semantics, like hierarchical communities, has been incorporated. Thus, they might be insufficient when dealing with the graphs containing noises from realworld complex systems. To address this issue, we propose a novel and effective GSL framework for node classification based on the structural information theory. Specifically, we first prove that an encoding tree with the minimal structural entropy could contain sufficient information for node classification and eliminate redundant noise via the graph's hierarchical abstraction. Then, we provide an efficient algorithm for constructing the encoding tree to enhance the basic views. Combining the community influence deduced from the encoding tree and the prediction confidence of each view, we further fuse the enhanced views to generate the optimal structure. Finally, we conduct extensive experiments on a variety of datasets. The results demonstrate that our method outperforms the state-of-the-art competitors on effectiveness and robustness.



Paperid:928
Authors:Cunhang Fan, Yujie Chen, Jun Xue, Yonghui Kong, Jianhua Tao, Zhao Lv
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Department of Automation, Tsinghua University Beijing National Research Center for lnformation Science and Technology, Tsinghua University, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University
Abstract:
In recent years, knowledge graph completion (KGC) models based on pretrained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly reduce the complexity of pre-trained models. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. However, traditional feature distillation suffers from the limitation of having a single representation of information in teacher models. To solve this problem, we propose masked generation of teacher-student features, which contain richer representation information. Furthermore, there is a significant gap in representation ability between teacher and student. Therefore, we design a progressive distillation method to distill student models at each grade level, enabling efficient knowledge transfer from teachers to students. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods. Furthermore, in the progressive distillation stage, the model significantly reduces the model parameters while maintaining a certain level of performance. Specifically, the model parameters of the lower-grade student model are reduced by 56.7\% compared to the baseline.



Paperid:929
Authors:Jinyong Fan, Yanyan Shen
Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Stock price forecasting is a fundamental yet challenging task in quantitative investment. Various researchers have developed a combination of neural network models (e.g., RNNs, GNNs, Transformers) for capturing complex indicator, temporal and stock correlations of the stock data.While complex architectures are highly expressive, they are often difficult to optimize and the performances are often compromised by the limited stock data. In this paper, we propose a simple MLPbased architecture named StockMixer which is easy to optimize and enjoys strong predictive performance. StockMixer performs indicator mixing, followed by time mixing, and finally stock mixing. Unlike the standard MLP-based mixing, we devise the time mixing to exchange multi-scale time patch information and realize the stock mixing by exploiting stock-to-market and market-to-stock influences explicitly. Extensive experiments on real stock benchmarks demonstrate our proposed StockMixer outperforms various state-of-the-art forecasting methods with a notable margin while reducing memory usage and runtime cost.Code is available at https://github.com/SJTU-Quant/StockMixer.



Paperid:930
Authors:Dazhi Fu, Zhao Zhang, Jicong Fan
The Chinese University of Hong Kong, Shenzhen University of Electronic Science and Technology of China, Hefei University of Technology, The Chinese University of Hong Kong, Shenzhen Shenzhen Research Institute of Big Data
Abstract:
This work presents a novel method called dense projection for unsupervised anomaly detection (DPAD). The main idea is maximizing the local density of (normal) training data and then determining whether a test data is anomalous or not by evaluating its density. Specifically, DPAD uses a deep neural network to learn locally dense representations of normal data. Since density estimation is computationally expensive, we minimize the local distances of the representations in an iteratively reweighting manner, where the weights are updated adaptively and the parameters are regularized to avoid model collapse (all representations collapse to a single point). Compared with many stateof-the-art methods of anomaly detection, our DPAD does not rely on any assumption about the distribution or spatial structure of the normal data and representations. Moreover, we provide theoretical guarantees for the effectiveness of DPAD. The experiments show that our method DPAD is effective not only in traditional one-class classification problems but also in scenarios with complex normal data composed of multiple classes.



Paperid:931
Authors:En-Hao Gao, Yu-Xuan Huang, Wen-Chao Hu, Xin-Hao Zhu, Wang-Zhou Dai
Nanjing University, Nanjing University, Nanjing University, Nanjing University, Nanjing University
Abstract:
Optical Character Recognition (OCR) of historical document images remains a challenging task because of the distorted input images, extensive number of uncommon characters, and the scarcity of labeled data, which impedes modern deep learningbased OCR techniques from achieving good recognition accuracy. Meanwhile, there exists a substantial amount of expert knowledge that can be utilized in this task. However, such knowledge is usually complicated and could only be accurately expressed with formal languages such as first-order logic (FOL), which is difficult to be directly integrated into deep learning models. This paper proposes KESAR, a novel Knowledge-Enhanced Document Segmentation And Recognition method for historical document images based on the Abductive Learning (ABL) framework. The segmentation and recognition models are enhanced by incorporating background knowledge for character extraction and prediction, followed by an efficient joint optimization of both models. We validate the effectiveness of KESAR on historical document datasets. The experimental results demonstrate that our method can simultaneously utilize knowledge-driven reasoning and data-driven learning, which outperforms the current state-of-the-art methods.



Paperid:932
Authors:Weibo Gao, Qi Liu, Hao Wang, Linan Yue, Haoyang Bi, Yin Gu, Fangzhou Yao, Zheng Zhang, Xin Li, Yuanjing He
University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Anhui Province Key Laboratory of Big Data Analysis and Application & State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China Artificial Intelligence Research Institute, iFLYTEK Co., Ltd, The Open University of China
Abstract:
Cognitive diagnosis seeks to estimate the cognitive states of students by exploring their logged practice quiz data. It plays a pivotal role in personalized learning guidance within intelligent education systems. In this paper, we focus on an important, practical, yet often underexplored task: domainlevel zero-shot cognitive diagnosis (DZCD), which arises due to the absence of student practice logs in newly launched domains. Recent cross-domain diagnostic models have been demonstrated to be a promising strategy for DZCD. These methods primarily focus on how to transfer student states across domains. However, they might inadvertently incorporate non-transferable information into student representations, thereby limiting the efficacy of knowledge transfer. To tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive diagnosis framework via one batch of early-bird students towards three diagnostic objectives. Our approach initiates with pre-training a diagnosis model with dual regularizers, which decouples student states into domain-shared and domain-specific parts. The shared cognitive signals can be transferred to the target domain, enriching the cognitive priors for the new domain, which ensures the cognitive state propagation objective. Subsequently, we devise a strategy to generate simulated practice logs for cold-start students through analyzing the behavioral patterns from early-bird students, fulfilling the domain-adaption goal. Consequently, we refine the cognitive states of cold-start students as diagnostic outcomes via virtual data, aligning with the diagnosis-oriented goal. Finally, extensive experiments on six real-world datasets highlight the efficacy of our model for DZCD and its practical application in question recommendation. The code is publicly available at https://github.com/bigdata-ustc/Zero-1-to-3.



Paperid:933
Authors:Zhuocheng Gong, Yang Song, Tao Zhang, Ji-Rong Wen, Dongyan Zhao, Rui Yan
Wangxuan Institute of Computer Technology, Peking University, BOSS Zhipin, BOSS Zhipin, Gaoling School of Artificial Intelligence, Renmin University of China, Wangxuan Institute of Computer Technology, Peking University National Key Laboratory of General Artificial Intelligence Beijing Institute for General Artificial Intelligence, Gaoling School of Artificial Intelligence, Renmin University of China
Abstract:
We are again confronted with one of the most vexing aspects of the advancement of technology: automation and AI technology cause the devaluation of human labor, resulting in unemployment. With this background, automatic personjob fit systems are promising solutions to promote the employment rate. The purpose of person-job fit is to calculate a matching score between the job seeker's resume and the job posting, determining whether the job seeker is suitable for the position. In this paper, we propose a new approach to person-job fit that characterizes the hidden preference derived from the job seeker's career path. We categorize and utilize three types of preferences in the career path: consistency, likeness, and continuity. We prove that understanding the career path enables us to provide more appropriate career suggestions to job seekers. To demonstrate the practical value of our proposed model, we conduct extensive experiments on real-world data extracted from an online recruitment platform and then present detailed cases to show how the career path matter in person-job fit.



Paperid:934
Authors:Poonam Goyal, Arshveer Kaur, Arvind Ram, Navneet Goyal
BITS Pilani, BITS Pilani, BITS Pilani, BITS Pilani
Abstract:
Satellite data bolstered by their increasing accessibility is leading to many endeavors of automated monitoring of the earth's surface for various applications. Such applications demand high spatial resolution images at a temporal resolution of a few days which entails the challenge of processing a huge volume of image time series data. To overcome this computing bottleneck, we present PatchNet, a bespoke adaptation of beam search and attention mechanism. PatchNet is an automated patch selection neural network that requires only a partial spatial traversal of an image time series and yet achieves impressive results. Satellite systems face a tradeoff between spatial and temporal resolutions due to budget/technical constraints e.g., Landsat-8/9 or Sentinel-2 have high spatial resolution whereas, MODIS has high temporal resolution. To deal with the limitation of coarse temporal resolution, we propose FuSITSNet, a twofold feature-based generic fusion model with multimodal learning in a contrastive setting. It produces a learned representation after fusion of two satellite image time series leveraging finer spatial resolution of Landsat and finer temporal resolution of MODIS. The patch alignment module of FuSITSNet aligns the PatchNet processed patches of Landsat-8 with the corresponding MODIS regions to incorporate its finer resolution temporal features. The untraversed patches are handled by the cross-modality attention which highlights additional hot spot features from the two modalities. We conduct extensive experiments on more than 2000 counties of US for crop yield, snow cover, and solar energy prediction and show that even one-fourth spatial processing of image time series produces state-of-the-art results. FuSITSNet outperforms the predictions of single modality and data obtained using existing generative fusion models and allows for monitoring of dynamic phenomena using freely accessible images, thereby unlocking new opportunities.



Paperid:935
Authors:Zhiqiang Guo, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, Bin Ruan
School of Computer Science and Technology, Huazhong University of Science and Technology, School of Computer Science and Technology, Huazhong University of Science and Technology, School of Software Engineering, Huazhong University of Science and Technology, Wuhan Digital Engineering Institute, Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), School of Computer Science and Technology, Huazhong University of Science and Technology
Abstract:
The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modalrelated graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimodal signals; (2) Lack of exploration into robust global user interests to alleviate the sparse interaction problems faced by local interest modeling. To address these issues, we propose a novel Local and Global Graph Learning-guided Multimodal Recommender (LGMRec), which jointly models local and global user interests. Specifically, we present a local graph embedding module to independently learn collaborative-related and modality-related embeddings of users and items with local topological relations. Moreover, a global hypergraph embedding module is designed to capture global user and item embeddings by modeling insightful global dependency relations. The global embeddings acquired within the hypergraph embedding space can then be combined with two decoupled local embeddings to improve the accuracy and robustness of recommendations. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our LGMRec over various state-of-the-art recommendation baselines, showcasing its effectiveness in modeling both local and global user interests.



Paperid:936
Authors:Zhongxuan Han, Chaochao Chen, Xiaolin Zheng, Meng Li, Weiming Liu, Binhui Yao, Yuyuan Li, Jianwei Yin
Zhejiang University, Zhejiang University, Zhejiang University, Harbin Institute of Technology (Shenzhen), Zhejiang university, Midea, Zhejiang University, Zhejiang University
Abstract:
Recommender systems are typically biased toward a small group of users, leading to severe unfairness in recommendation performance, i.e., UserOriented Fairness (UOF) issue. Existing research on UOF exhibits notable limitations in two phases of recommendation models. In the training phase, current methods fail to tackle the root cause of the UOF issue, which lies in the unfair training process between advantaged and disadvantaged users. In the evaluation phase, the current UOF metric lacks the ability to comprehensively evaluate varying cases of unfairness. In this paper, we aim to address the aforementioned limitations and ensure recommendation models treat user groups of varying activity levels equally. In the training phase, we propose a novel Intra- and Inter-GrOup Optimal Transport framework (II-GOOT) to alleviate the data sparsity problem for disadvantaged users and narrow the training gap between advantaged and disadvantaged users. In the evaluation phase, we introduce a novel metric called ?-UOF, which enables the identification and assessment of various cases of UOF. This helps prevent recommendation models from leading to unfavorable fairness outcomes, where both advantaged and disadvantaged users experience subpar recommendation performance. We conduct extensive experiments on three real-world datasets based on four backbone recommendation models to prove the effectiveness of ?-UOF and the efficiency of our proposed II-GOOT.



Paperid:937
Authors:Haoyang He, Jiangning Zhang, Hongxu Chen, Xuhai Chen, Zhishan Li, Xu Chen, Yabiao Wang, Chengjie Wang, Lei Xie
Zhejiang University, Tencent Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Tencent, Tencent, Tencent, Zhejiang University
Abstract:
Reconstructionbased approaches have achieved remarkable outcomes in anomaly detection. The exceptional image reconstruction capabilities of recently popular diffusion models have sparked research efforts to utilize them for enhanced reconstruction of anomalous images. Nonetheless, these methods might face challenges related to the preservation of image categories and pixel-wise structural integrity in the more practical multi-class setting. To solve the above problems, we propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection, which consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion’s denoising network, and a feature-space pre-trained feature extractor. Firstly, The SG network is proposed for reconstructing anomalous regions while preserving the original image’s semantic information. Secondly, we introduce Spatial-aware Feature Fusion (SFF) block to maximize reconstruction accuracy when dealing with extensively reconstructed areas. Thirdly, the input and reconstructed images are processed by a pre-trained feature extractor to generate anomaly maps based on features extracted at different scales. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach which surpasses the state-of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0 (AUROC/AP) for localization and detection respectively on multi-class MVTec-AD dataset. Code will be available at https://lewandofskee.github.io/projects/diad.



Paperid:938
Authors:Junwei He, Qianqian Xu, Yangbangyan Jiang, Zitai Wang, Qingming Huang
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences School of Computer Science and Technology, University of Chinese Academy of Sciences, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, School of Computer Science and Technology, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences School of Computer Science and Technology, University of Chinese Academy of Sciences Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences
Abstract:
Graph anomaly detection is crucial for identifying nodes that deviate from regular behavior within graphs, benefiting various domains such as fraud detection and social network. Although existing reconstructionbased methods have achieved considerable success, they may face the Anomaly Overfitting and Homophily Trap problems caused by the abnormal patterns in the graph, breaking the assumption that normal nodes are often better reconstructed than abnormal ones. Our observations indicate that models trained on graphs with fewer anomalies exhibit higher detection performance. Based on this insight, we introduce a novel two-stage framework called Anomaly-Denoised Autoencoders for Graph Anomaly Detection (ADA-GAD). In the first stage, we design a learning-free anomaly-denoised augmentation method to generate graphs with reduced anomaly levels. We pretrain graph autoencoders on these augmented graphs at multiple levels, which enables the graph autoencoders to capture normal patterns. In the next stage, the decoders are retrained for detection on the original graph, benefiting from the multi-level representations learned in the previous stage. Meanwhile, we propose the node anomaly distribution regularization to further alleviate Anomaly Overfitting. We validate the effectiveness of our approach through extensive experiments on both synthetic and real-world datasets.



Paperid:939
Authors:Yuchen He, Zeqing Yuan, Yihong Wu, Liqi Cheng, Dazhen Deng, Yingcai Wu
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
The immense popularity of racket sports has fueled substantial demand in tactical analysis with broadcast videos. However, existing manual methods require laborious annotation, and recent attempts leveraging video perception models are limited to lowlevel annotations like ball trajectories, overlooking tactics that necessitate an understanding of stroke techniques. State-of-the-art action segmentation models also struggle with technique recognition due to frequent occlusions and motion-induced blurring in racket sports videos. To address these challenges, We propose ViSTec, a Video-based Sports Technique recognition model inspired by human cognition that synergizes sparse visual data with rich contextual insights. Our approach integrates a graph to explicitly model strategic knowledge in stroke sequences and enhance technique recognition with contextual inductive bias. A two-stage action perception model is jointly trained to align with the contextual knowledge in the graph. Experiments demonstrate that our method outperforms existing models by a significant margin. Case studies with experts from the Chinese national table tennis team validate our model's capacity to automate analysis for technical actions and tactical strategies. More details are available at: https://ViSTec2024.github.io/.



Paperid:940
Authors:Xiaobin Hong, Wenzhong Li, Chaoqun Wang, Mingkai Lin, Sanglu Lu
Nanjing University, Nanjing University, The Chinese University of Hong Kong, Shenzhen, Nanjing University, Nanjing University
Abstract:
Graph Neural Networks (GNNs) have emerged as a powerful tool for modeling graphstructured data, exhibiting remarkable potential in applications such as social networks, recommendation systems, and molecular structures. However, the conventional GNNs perform node-level feature aggregation from neighbors without considering graph-label information, which leads to the misaligned embedding problem that may cause a detrimental effect on graph-level tasks such as graph classification. In this paper, we propose a novel label-attentive distillation method called LAD-GNN for graph representation learning to solve this problem. It alternatively trains a teacher model and a student GNN with a distillation-based approach. In the teacher model, a label-attentive encoder is proposed to encode the label information fusing with the node features to generate ideal embedding. In the student model, the ideal embedding is used as intermediate supervision to urge the student GNN to learn class-friendly node embedding to facilitate graph-level tasks. Generally, LAD-GNN is an enhanced GNN training approach that can be incorporated with arbitrary GNN backbone to improve performance without significant increase of computational cost. Extensive experiments with 7 GNN backbones based on 10 benchmark datasets show that LAD-GNN improves the SOTA GNNs in graph classification accuracy. The source codes of LAD-GNN are publicly available on https://github.com/XiaobinHong/LAD-GNN.



Paperid:941
Authors:Dongpeng Hou, Chao Gao, Xuelong Li, Zhen Wang
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Propagation models in social networks are critical, with extensive applications across various fields and downstream tasks. However, existing propagation models are often oversimplified, scenariospecific, and lack real-world user social attributes. These limitations detaching from real-world analysis lead to inaccurate representations of the propagation process in social networks. To address these issues, we propose a User Features Attention-based DAG-Aware Variational Autoencoder (DAVA) for propagation graph generation. First, nearly 1 million pieces of user attributes data are collected. Then DAVA can integrate the analysis of propagation graph topology and corresponding user attributes as prior knowledge. By leveraging a lightweight attention-based framework and a sliding window mechanism based on BFS permutations weighted by user influence, DAVA significantly enhances the ability to generate realistic, large-scale propagation data, yielding graph scales ten times greater than those produced by existing SOTA methods. Every module of DAVA has flexibility and extension that allows for easy substitution to suit other generation tasks. Additionally, we provide a comprehensive evaluation of DAVA, one focus is the effectiveness of generated data in improving the performance of downstream tasks. During the generation process, we discover the Credibility Erosion Effect by modifying the generation rules, revealing a social phenomenon in social network propagation.



Paperid:942
Authors:Bay-Yuan Hsu, Chih-Ya Shen, Hao Shan Yuan, Wang-Chien Lee, De-Nian Yang
National Tsing Hua University, National Tsing Hua University, National Tsing Hua University, Pennsylvania State University, USA, Academia Sinica
Abstract:
Virtual Reality (VR) has emerged due to advancements in hardware and computer graphics. During the pandemic, conferences and exhibitions leveraging VR have gained attention. However, largescale VR conferences, face a significant problem not yet studied in the literature -- displaying too many irrelevant users on the screen which may negatively impact the user experience. To address this issue, we formulate a new research problem, Social-Aware VR Conference Group Display Configuration (SVGD). Accordingly, we design the Social Utility-Aware VR Conference Group Formation (SVC) algorithm, which is a 2-approximation algorithm to SVGD. SVC iteratively selects either the P-Configuration or S-Configuration based on their effective ratios. This ensures that in each iteration, SVC identifies and chooses the solution with the highest current effectiveness. Experiments on real metaverse datasets show that the proposed SVC outperforms 11 baselines by 75% in terms of solution quality.



Paperid:943
Authors:Teng Hu, Jiangning Zhang, Ran Yi, Yuzhen Du, Xu Chen, Liang Liu, Yabiao Wang, Chengjie Wang
Shanghai Jiao Tong University, Tencent, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Tencent, Tencent, Tencent, Shanghai Jiao Tong University Tencent
Abstract:
Anomaly inspection plays an important role in industrial manufacture. Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data. Although anomaly generation methods have been proposed to augment the anomaly data, they either suffer from poor generation authenticity or inaccurate alignment between the generated anomalies and masks. To address the above problems, we propose AnomalyDiffusion, a novel diffusionbased few-shot anomaly generation model, which utilizes the strong prior information of latent diffusion model learned from large-scale dataset to enhance the generation authenticity under few-shot training data. Firstly, we propose Spatial Anomaly Embedding, which consists of a learnable anomaly embedding and a spatial embedding encoded from an anomaly mask, disentangling the anomaly information into anomaly appearance and location information. Moreover, to improve the alignment between the generated anomalies and the anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism. Based on the disparities between the generated anomaly image and normal sample, it dynamically guides the model to focus more on the areas with less noticeable generated anomalies, enabling generation of accurately-matched anomalous image-mask pairs. Extensive experiments demonstrate that our model significantly outperforms the state-of-the-art methods in generation authenticity and diversity, and effectively improves the performance of downstream anomaly inspection tasks. The code and data are available in https://github.com/sjtuplayer/anomalydiffusion.



Paperid:944
Authors:Tianhao Huang, Xuan Pan, Xiangrui Cai, Ying Zhang, Xiaojie Yuan
College of Computer Science, Nankai University, College of Computer Science, Nankai University Tianjin Key Laboratory of Network and Data Security Technology, Tianjin, China, College of Computer Science, Nankai University Key Laboratory of Data and Intelligent System Security, Ministry of Education, China Science and Technology on Communication Networks Laboratory, Shijiazhuang, China, College of Computer Science, Nankai University Key Laboratory of Data and Intelligent System Security, Ministry of Education, China, College of Computer Science, Nankai University Tianjin Key Laboratory of Network and Data Security Technology, Tianjin, China Key Laboratory of Data and Intelligent System Security, Ministry of Education, China
Abstract:
Next Pointof-Interests (POIs) recommendation task aims to provide a dynamic ranking of POIs based on users' current check-in trajectories. The recommendation performance of this task is contingent upon a comprehensive understanding of users' personalized behavioral patterns through Location-based Social Networks (LBSNs) data. While prior studies have adeptly captured sequential patterns and transitional relationships within users' check-in trajectories, a noticeable gap persists in devising a mechanism for discerning specialized behavioral patterns during distinct time slots, such as noon, afternoon, or evening. In this paper, we introduce an innovative data structure termed the ``Mobility Tree'', tailored for hierarchically describing users' check-in records. The Mobility Tree encompasses multi-granularity time slot nodes to learn user preferences across varying temporal periods. Meanwhile, we propose the Mobility Tree Network (MTNet), a multitask framework for personalized preference learning based on Mobility Trees. We develop a four-step node interaction operation to propagate feature information from the leaf nodes to the root node. Additionally, we adopt a multitask training strategy to push the model towards learning a robust representation. The comprehensive experimental results demonstrate the superiority of MTNet over eleven state-of-the-art next POI recommendation models across three real-world LBSN datasets, substantiating the efficacy of time slot preference learning facilitated by Mobility Tree.



Paperid:945
Authors:Cheng Ji, Zixuan Huang, Qingyun Sun, Hao Peng, Xingcheng Fu, Qian Li, Jianxin Li
Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, China School of Computer Science and Engineering, Beihang University, China, School of Computer Science and Engineering, Beihang University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, China School of Computer Science and Engineering, Beihang University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, China School of Computer Science and Engineering, Beihang University, China, Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, China School of Computer Science and Engineering, Beihang University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, China School of Computer Science and Engineering, Beihang University, China
Abstract:
Graph contrastive learning (GCL) has demonstrated remarkable efficacy in graph representation learning. However, previous studies have overlooked the inherent conflict that arises when employing graph neural networks (GNNs) as encoders for nodelevel contrastive learning. This conflict pertains to the partial incongruity between the feature aggregation mechanism of graph neural networks and the embedding distinction characteristic of contrastive learning. Theoretically, to investigate the location and extent of the conflict, we analyze the participation of message-passing from the gradient perspective of InfoNCE loss. Different from contrastive learning in other domains, the conflict in GCL arises due to the presence of certain samples that contribute to both the gradients of positive and negative simultaneously under the manner of message passing, which are opposite optimization directions. To further address the conflict issue, we propose a practical framework called ReGCL, which utilizes theoretical findings of GCL gradients to effectively improve graph contrastive learning. Specifically, two gradient-based strategies are devised in terms of both message passing and loss function to mitigate the conflict. Firstly, a gradient-guided structure learning method is proposed in order to acquire a structure that is adapted to contrastive learning principles. Secondly, a gradient-weighted InfoNCE loss function is designed to reduce the impact of false negative samples with high probabilities, specifically from the standpoint of the graph encoder. Extensive experiments demonstrate the superiority of the proposed method in comparison to state-of-the-art baselines across various node classification benchmarks.



Paperid:946
Authors:Pengyue Jia, Yichao Wang, Shanru Lin, Xiaopeng Li, Xiangyu Zhao, Huifeng Guo, Ruiming Tang
City University of Hong Kong, Huawei Noah's Ark Lab, City University of Hong Kong, City University of Hong Kong, City University of Hong Kong, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab
Abstract:
To enhance the efficacy of multiscenario services in industrial recommendation systems, the emergence of multi-domain recommendation has become prominent, which entails simultaneous modeling of all domains through a unified model, effectively capturing commonalities and differences among them. However, current methods rely on manual domain partitioning, which overlook the intricate domain relationships and the heterogeneity of different domains during joint optimization, hindering the integration of domain commonalities and differences. To address these challenges, this paper proposes a universal and flexible framework D3 aimed at optimizing the multi-domain recommendation pipeline from three key aspects. Firstly, an attention-based domain adaptation module is introduced to automatically identify and incorporate domain-sensitive features during training. Secondly, we propose a fusion gate module that enables the seamless integration of commonalities and diversities among domains, allowing for implicit characterization of intricate domain relationships. Lastly, we tackle the issue of joint optimization by deriving loss weights from two complementary viewpoints: domain complexity and domain specificity, alleviating inconsistencies among different domains during the training phase. Experiments on three public datasets demonstrate the effectiveness and superiority of our proposed framework. In addition, D3 has been implemented on a real-life, high-traffic internet platform catering to millions of users daily.



Paperid:947
Authors:Tianrui Jia, Haoyang Li, Cheng Yang, Tao Tao, Chuan Shi
Beijing University of Posts and Telecommunications, Tsinghua University, School of Computer Science, Beijing University of Posts and Telecommunications, China Mobile Information Technology Co. Ltd., Beijing University of Posts and Telecommunications
Abstract:
Graph neural networks (GNNs) have been demonstrated to perform well in graph representation learning, but always lacking in generalization capability when tackling outof-distribution (OOD) data. Graph invariant learning methods, backed by the invariance principle among defined multiple environments, have shown effectiveness in dealing with this issue. However, existing methods heavily rely on well-predefined or accurately generated environment partitions, which are hard to be obtained in practice, leading to sub-optimal OOD generalization performances. In this paper, we propose a novel graph invariant learning method based on invariant and variant patterns co-mixup strategy, which is capable of jointly generating mixed multiple environments and capturing invariant patterns from the mixed graph data. Specifically, we first adopt a subgraph extractor to identify invariant subgraphs. Subsequently, we design one novel co-mixup strategy, i.e., jointly conducting environment mixup and invariant mixup. For the environment mixup, we mix the variant environment-related subgraphs so as to generate sufficiently diverse multiple environments, which is important to guarantee the quality of the graph invariant learning. For the invariant mixup, we mix the invariant subgraphs, further encouraging to capture invariant patterns behind graphs while getting rid of spurious correlations for OOD generalization. We demonstrate that the proposed environment mixup and invariant mixup can mutually promote each other. Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art under various distribution shifts.



Paperid:948
Authors:Pengfei Jiao, Hongqian Chen, Qing Bao, Wang Zhang, Huaming Wu
Hangzhou Dianzi University, Hangzhou Dianzi University, Hangzhou Dianzi University, Tianjin University, Tianjin University
Abstract:
Information diffusion prediction plays a crucial role in understanding the propagation of information in social networks, encompassing both macroscopic and microscopic prediction tasks. Macroscopic prediction estimates the overall impact of information diffusion, while microscopic prediction focuses on identifying the next user to be influenced. While prior research often concentrates on one of these aspects, a few tackle both concurrently. These two tasks provide complementary insights into the diffusion process at different levels, revealing common traits and unique attributes. The exploration of leveraging common features across these tasks to enhance information prediction remains an underexplored avenue. In this paper, we propose an intuitive and effective model that addresses both macroscopic and microscopic prediction tasks. Our approach considers the interactions and dynamics among cascades at the macro level and incorporates the social homophily of users in social networks at the micro level. Additionally, we introduce adversarial training and orthogonality constraints to ensure the integrity of shared features. Experimental results on four datasets demonstrate that our model significantly outperforms stateof-the-art methods.



Paperid:949
Authors:Hyunjun Ju, SeongKu Kang, Dongha Lee, Junyoung Hwang, Sanghwan Jang, Hwanjo Yu
Pohang University of Science and Technology (POSTECH), Republic of Korea, Pohang University of Science and Technology (POSTECH), Republic of Korea University of Illinois at Urban-Champaign (UIUC), United States, Yonsei University, Republic of Korea, Pohang University of Science and Technology (POSTECH), Republic of Korea, Pohang University of Science and Technology (POSTECH), Republic of Korea, Pohang University of Science and Technology (POSTECH), Republic of Korea
Abstract:
Recently, web platforms are operating various service domains simultaneously. Targeting a platform that operates multiple service domains, we introduce a new task, MultiDomain Recommendation to Attract Users (MDRAU), which recommends items from multiple ``unseen'' domains with which each user has not interacted yet, by using knowledge from the user's ``seen'' domains. In this paper, we point out two challenges of MDRAU task. First, there are numerous possible combinations of mappings from seen to unseen domains because users have usually interacted with a different subset of service domains. Second, a user might have different preference for each of the target unseen domains, which requires recommendations to reflect users' preference on domains as well as items. To tackle these challenges, we propose DRIP framework that models users' preference at two levels (i.e., domain and item) and learns various seen-unseen domain mappings in a unified way with masked domain modeling. Our extensive experiments demonstrate the effectiveness of DRIP in MDRAU task and its ability to capture users' domain-level preferences.



Paperid:950
Authors:Soopil Kim, Sion An, Philip Chikontwe, Myeongkyun Kang, Ehsan Adeli, Kilian M. Pohl, Sang Hyun Park
DGIST Stanford University, DGIST, DGIST, DGIST Stanford University, Stanford University, Stanford University, DGIST
Abstract:
Logical anomalies (LA) refer to data violating underlying logical constraints e.g., the quantity, arrangement, or composition of components within an image. Detecting accurately such anomalies requires models to reason about various component types through segmentation. However, curation of pixellevel annotations for semantic segmentation is both time-consuming and expensive. Although there are some prior few-shot or unsupervised co-part segmentation algorithms, they often fail on images with industrial object. These images have components with similar textures and shapes, and a precise differentiation proves challenging. In this study, we introduce a novel component segmentation model for LA detection that leverages a few labeled samples and unlabeled images sharing logical constraints. To ensure consistent segmentation across unlabeled images, we employ a histogram matching loss in conjunction with an entropy loss. As segmentation predictions play a crucial role, we propose to enhance both local and global sample validity detection by capturing key aspects from visual semantics via three memory banks: class histograms, component composition embeddings and patch-level representations. For effective LA detection, we propose an adaptive scaling strategy to standardize anomaly scores from different memory banks in inference. Extensive experiments on the public benchmark MVTec LOCO AD reveal our method achieves 98.1% AUROC in LA detection vs. 89.6% from competing methods.



Paperid:951
Authors:Taeri Kim, Jiho Heo, Hongil Kim, Kijung Shin, Sang-Wook Kim
Department of Computer Science, Hanyang University, South Korea, Department of Computer Science, Hanyang University, South Korea, Department of Artificial Intelligence, Hanyang University, South Korea, Kim Jaechul Graduate School of AI & School of Electrical Engineering, KAIST, South Korea, Department of Computer Science, Hanyang University, South Korea
Abstract:
We address the medication recommendation problem, which aims to recommend effective medications for a patient's current visit by utilizing information (e.g., diagnoses and procedures) given at the patient's current and past visits. While there exist a number of recommender systems designed for this problem, we point out that they are challenged in accurately capturing the relation (spec., the degree of relevance) between the current and each of the past visits for the patient when obtaining her current health status, which is the basis for recommending medications. To address this limitation, we propose a novel medication recommendation framework, named VITA, based on the following two novel ideas: (1) relevantVisit selectIon; (2) Target-aware Attention. Through extensive experiments using real-world datasets, we demonstrate the superiority of VITA (spec., up to 5.67% higher accuracy, in terms of Jaccard, than the best competitor) and the effectiveness of its two core ideas. The code is available at https://github.com/jhheo0123/VITA.



Paperid:952
Authors:Aritra Konar, Nicholas D. Sidiropoulos
KU Leuven, University of Virginia
Abstract:
Dense subgraph discovery (DSD) is a key primitive in graph mining that typically deals with extracting cliques and nearcliques. In this paper, we revisit the optimal quasi-clique (OQC) formulation for DSD and establish that it is NP--hard. In addition, we reveal the hitherto unknown property that OQC can be used to explore the entire spectrum of densest subgraphs of all distinct sizes by appropriately varying a single hyperparameter, thereby forging an intimate link with the classic densest-k-subgraph problem (DkS). We corroborate these findings on real-world graphs by applying the simple greedy algorithm for OQC with improved hyperparameter tuning, to quickly generate high-quality approximations of the size-density frontier. Our findings indicate that OQC not only extracts high quality (near)-cliques, but also large and loosely-connected subgraphs that exhibit well defined local community structure. The latter discovery is particularly intriguing, since OQC is not explicitly geared towards community detection.



Paperid:953
Authors:Dexu Kong, Anping Zhang, Yang Li
Shenzhen Key Laboratory of Ubiquitous Data Enabling, Shenzhen International Graduate School, Tsinghua University, Shenzhen Key Laboratory of Ubiquitous Data Enabling, Shenzhen International Graduate School, Tsinghua University, Shenzhen Key Laboratory of Ubiquitous Data Enabling, Shenzhen International Graduate School, Tsinghua University
Abstract:
Dynamic community detection methods often lack effective mechanisms to ensure temporal consistency, hindering the analysis of network evolution. In this paper, we propose a novel deep graph clustering framework with temporal consistency regularization on intercommunity structures, inspired by the concept of minimal network topological changes within short intervals. Specifically, to address the representation collapse problem, we first introduce MFC, a matrix factorization-based deep graph clustering algorithm that preserves node embedding. Based on static clustering results, we construct probabilistic community networks and compute their persistence homology, a robust topological measure, to assess structural similarity between them. Moreover, a novel neural network regularization TopoReg is introduced to ensure the preservation of topological similarity between inter-community structures over time intervals. Our approach enhances temporal consistency and clustering accuracy on real-world datasets with both fixed and varying numbers of communities. It is also a pioneer application of TDA in temporally persistent community detection, offering an insightful contribution to field of network analysis. Code and data are available at the public git repository: https://github.com/kundtx/MFC-TopoReg.



Paperid:954
Authors:Weiyang Kong, Ziyu Guo, Yubao Liu
Sun Yat-Sen University, Sun Yat-Sen University, Sun Yat-Sen University Guangdong Key Laboratory of Big Data Analysis and Processing
Abstract:
Traffic flow forecasting is a classical spatiotemporal data mining problem with many real-world applications. Recently, various methods based on Graph Neural Networks (GNN) have been proposed for the problem and achieved impressive prediction performance. However, we argue that the majority of existing methods disregarding the importance of certain nodes (referred to as pivotal nodes) that naturally exhibit extensive connections with multiple other nodes. Predicting on pivotal nodes poses a challenge due to their complex spatio-temporal dependencies compared to other nodes. In this paper, we propose a novel GNN-based method called Spatio-Temporal Pivotal Graph Neural Networks (STPGNN) to address the above limitation. We introduce a pivotal node identification module for identifying pivotal nodes. We propose a novel pivotal graph convolution module, enabling precise capture of spatio-temporal dependencies centered around pivotal nodes. Moreover, we propose a parallel framework capable of extracting spatio-temporal traffic features on both pivotal and non-pivotal nodes. Experiments on seven real-world traffic datasets verify our proposed method's effectiveness and efficiency compared to state-of-the-art baselines.



Paperid:955
Authors:Kai-Huang Lai, Zhe-Rui Yang, Pei-Yuan Lai, Chang-Dong Wang, Mohsen Guizani, Min Chen
Sun Yat-sen University, Sun Yat-sen University, South China Technology Commercialization Center, Sun Yat-sen University, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), South China University of Technology Pazhou Lab
Abstract:
Reciprocal recommender systems (RRS) have been widely used in online platforms such as online dating and recruitment. They can simultaneously fulfill the needs of both parties involved in the recommendation process. Due to the inherent nature of the task, interaction data is relatively sparse compared to other recommendation tasks. Existing works mainly address this issue through contentbased recommendation methods. However, these methods often implicitly model textual information from a unified perspective, making it challenging to capture the distinct intentions held by each party, which further leads to limited performance and the lack of interpretability. In this paper, we propose a Knowledge-Aware Explainable Reciprocal Recommender System (KAERR), which models metapaths between two parties independently, considering their respective perspectives and requirements. Various metapaths are fused using an attention-based mechanism, where the attention weights unveil dual-perspective preferences and provide recommendation explanations for both parties. Extensive experiments on two real-world datasets from diverse scenarios demonstrate that the proposed model outperforms state-of-the-art baselines, while also delivering compelling reasons for recommendations to both parties.



Paperid:956
Authors:Daixun Li, Weiying Xie, Jiaqing Zhang, Yunsong Li
Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Highdimensional images, known for their rich semantic information, are widely applied in remote sensing and other fields. The spatial information in these images reflects the object's texture features, while the spectral information reveals the potential spectral representations across different bands. Currently, the understanding of high-dimensional images remains limited to a single-domain perspective with performance degradation. Motivated by the masking texture effect observed in the human visual system, we present a multi-domain diffusion-driven feature learning network (MDFL) , a scheme to redefine the effective information domain that the model really focuses on. This method employs diffusion-based posterior sampling to explicitly consider joint information interactions between the high-dimensional manifold structures in the spectral, spatial, and frequency domains, thereby eliminating the influence of masking texture effects in visual models. Additionally, we introduce a feature reuse mechanism to gather deep and raw features of high-dimensional data. We demonstrate that MDFL significantly improves the feature extraction performance of high-dimensional data, thereby providing a powerful aid for revealing the intrinsic patterns and structures of such data. The experimental results on three multi-modal remote sensing datasets show that MDFL reaches an average overall accuracy of 98.25%, outperforming various state-of-the-art baseline schemes. Code available at https://github.com/LDXDU/MDFL-AAAI-24.



Paperid:957
Authors:Kexin Li, Chengjiang Long, Shengyu Zhang, Xudong Tang, Zhichao Zhai, Kun Kuang, Jun Xiao
Zhejiang University, Meta Reality Labs, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
Next set recommendation aims to predict the items that are likely to be bought in the next purchase. Central to this endeavor is the task of capturing intraset and cross-set correlations among items. However, the modeling of cross-set correlations poses challenges due to specific issues. Primarily, these correlations are often implicit, and the prevailing approach of establishing an indiscriminate link across the entire set of objects neglects factors like purchase frequency and correlations between purchased items. Such hastily formed connections across sets introduce substantial noise. Additionally, the preeminence of high-frequency items in numerous sets could potentially overshadow and distort correlation modeling with respect to low-frequency items. Thus, we devoted to mitigating misleading inter-set correlations. With a fresh perspective rooted in causality, we delve into the question of whether correlations between a particular item and items from other sets should be relied upon for item representation learning and set prediction. Technically, we introduce the Counterfactual Correlation Inference framework for next set recommendation, denoted as CoreRec. This framework establishes a counterfactual scenario in which the recommendation model impedes cross-set correlations to generate intervened predictions. By contrasting these intervened predictions with the original ones, we gauge the causal impact of inter-set neighbors on set prediction—essentially assessing whether they contribute to spurious correlations. During testing, we introduce a post-trained switch module that selects between set-aware item representations derived from either the original or the counterfactual scenarios. To validate our approach, we extensively experiment using three real-world datasets, affirming both the effectiveness of CoreRec and the cogency of our analytical approach.



Paperid:958
Authors:Lei Li, Jianxun Lian, Xiao Zhou, Xing Xie
Gaoling School of Artificial Intelligence, Renmin University of China, Microsoft Research Asia, Gaoling School of Artificial Intelligence, Renmin University of China, Microsoft Research Asia
Abstract:
Retrieval models aim at selecting a small set of item candidates which match the preference of a given user. They play a vital role in largescale recommender systems since subsequent models such as rankers highly depend on the quality of item candidates. However, most existing retrieval models employ a single-round inference paradigm, which may not adequately capture the dynamic nature of user preferences and stuck in one area in the item space. In this paper, we propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for recommender systems that iteratively refines user representations to better capture potential candidates in the full item space. Ada-Retrieval comprises two key modules: the item representation adapter and the user representation adapter, designed to inject context information into items' and users' representations. The framework maintains a model-agnostic design, allowing seamless integration with various backbone models such as RNNs or Transformers. We perform experiments on three widely used public datasets, incorporating five powerful sequential recommenders as backbone models. Our results demonstrate that Ada-Retrieval significantly enhances the performance of various base models, with consistent improvements observed across different datasets. Our code and data are publicly available at: https://github.com/ll0ruc/Ada-Retrieval.



Paperid:959
Authors:Rui Li, Liyang He, Qi Liu, Yuze Zhao, Zheng Zhang, Zhenya Huang, Yu Su, Shijin Wang
Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, School of Computer Science and Artificial Intelligence, Hefei Normal University, State Key Laboratory of Cognitive Intelligence iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd.
Abstract:
Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in realworld multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider.



Paperid:960
Authors:Xiaoxi Li, Yujia Zhou, Zhicheng Dou
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education, Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education
Abstract:
Generative information retrieval, encompassing two major tasks of Generative Document Retrieval (GDR) and Grounded Answer Generation (GAR), has gained significant attention in natural language processing. Existing methods for GDR and GAR rely on separate retrieval and reader modules, which hinder simultaneous optimization. To overcome this, we present UniGen, a Unified Generative framework for retrieval and question answering that integrates both tasks into a single generative model leveraging the capabilities of large language models. UniGen employs a shared encoder and two distinct decoders for generative retrieval and question answering. To facilitate the learning of both tasks, we introduce connectors, generated by large language models, to bridge the gaps between query inputs and generation targets, as well as between document identifiers and answers. Furthermore, we propose an iterative enhancement strategy that leverages generated answers and retrieved documents to iteratively improve both tasks. Through extensive experiments on the MS MARCO and NQ datasets, we demonstrate the effectiveness of UniGen, showcasing its superior performance in both retrieval and question answering tasks.



Paperid:961
Authors:Yangning Li, Tingwei Lu, Hai-Tao Zheng, Yinghui Li, Shulin Huang, Tianyu Yu, Jun Yuan, Rui Zhang
Shenzhen International Graduate School, Tsinghua University PengCheng Laboratory, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University PengCheng Laboratory, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, Huawei' Noah's Ark Lab, Huawei' Noah's Ark Lab
Abstract:
The Entity Set Expansion (ESE) task aims to expand a handful of seed entities with new entities belonging to the same semantic class. Conventional ESE methods are based on monomodality (i.e., literal modality), which struggle to deal with complex entities in the real world such as (1) Negative entities with fine-grained semantic differences. (2) Synonymous entities. (3) Polysemous entities. (4) Long-tailed entities. These challenges prompt us to propose novel Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities. Intuitively, the benefits of multi-modal information for ESE are threefold: (1) Different modalities can provide complementary information. (2) Multi-modal information provides a unified signal via common visual properties for the same semantic class or entity. (3) Multi-modal information offers robust alignment signals for synonymous entities. To assess model performance in MESE, we constructed the MESED dataset which is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration. A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks. The extensive experiments and analyses on MESED demonstrate the high quality of the dataset and the effectiveness of our MultiExpan, as well as pointing the direction for future research. The benchmark and code are public at https://github.com/THUKElab/MESED.



Paperid:962
Authors:Yibo Li, Xiao Wang, Hongrui Liu, Chuan Shi
Beijing University of Posts and Telecommunications, Beihang University, Ant Group, Beijing University of Posts and Telecommunications
Abstract:
Recent studies reveal the connection between GNNs and the diffusion process, which motivates many diffusion based GNNs to be proposed. However, since these two mechanisms are closely related, one fundamental question naturally arises: Is there a general diffusion framework that can formally unify these GNNs? The answer to this question can not only deepen our understanding of the learning process of GNNs, but also may open a new door to design a broad new class of GNNs. In this paper, we propose a general diffusion equation framework with the fidelity term, which formally establishes the relationship between the diffusion process with more GNNs. Meanwhile, with this framework, we identify one characteristic of graph diffusion networks, i.e., the current neural diffusion process only corresponds to the firstorder diffusion equation. However, by an experimental investigation, we show that the labels of high-order neighbors actually appear monophily property, which induces the similarity based on labels among high-order neighbors without requiring the similarity among first-order neighbors. This discovery motives to design a new high-order neighbor-aware diffusion equation, and derive a new type of graph diffusion network (HiD-Net) based on the framework. With the high-order diffusion equation, HiD-Net is more robust against attacks and works on both homophily and heterophily graphs. We not only theoretically analyze the relation between HiD-Net with high-order random walk, but also provide a theoretical convergence guarantee. Extensive experimental results well demonstrate the effectiveness of HiD-Net over state-of-the-art graph diffusion networks.



Paperid:963
Authors:Yongqi Li, Nan Yang, Liang Wang, Furu Wei, Wenjie Li
The Hong Kong Polytechnic University, Microsoft, Microsoft, Microsoft, The Hong Kong Polytechnic University
Abstract:
Generative retrieval stands out as a promising new paradigm in text retrieval that aims to generate identifier strings of relevant passages as the retrieval target. This generative paradigm taps into powerful generative language models, distinct from traditional sparse or dense retrieval methods. However, only learning to generate is insufficient for generative retrieval. Generative retrieval learns to generate identifiers of relevant passages as an intermediate goal and then converts predicted identifiers into the final passage rank list. The disconnect between the learning objective of autoregressive models and the desired passage ranking target leads to a learning gap. To bridge this gap, we propose a learningto-rank framework for generative retrieval, dubbed LTRGR. LTRGR enables generative retrieval to learn to rank passages directly, optimizing the autoregressive model toward the final passage ranking target via a rank loss. This framework only requires an additional learning-to-rank training phase to enhance current generative retrieval systems and does not add any burden to the inference stage. We conducted experiments on three public benchmarks, and the results demonstrate that LTRGR achieves state-of-the-art performance among generative retrieval methods. The code and checkpoints are released at https://github.com/liyongqi67/LTRGR.



Paperid:964
Authors:Zechen Li, Weiming Huang, Kai Zhao, Min Yang, Yongshun Gong, Meng Chen
School of Software, Shandong University, School of Computer Science and Engineering, Nanyang Technological University, Robinson College of Business, Georgia State University, School of Software, Shandong University, School of Software, Shandong University, School of Software, Shandong University Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources
Abstract:
Recently, learning urban region representations utilizing multimodal data (information views) has become increasingly popular, for deep understanding of the distributions of various socioeconomic features in cities. However, previous methods usually blend multi-view information in a posteriors stage, falling short in learning coherent and consistent representations across different views. In this paper, we form a new pipeline to learn consistent representations across varying views, and propose the multi-view Contrastive Prediction model for urban Region embedding (ReCP), which leverages the multiple information views from point-of-interest (POI) and human mobility data. Specifically, ReCP comprises two major modules, namely an intra-view learning module utilizing contrastive learning and feature reconstruction to capture the unique information from each single view, and inter-view learning module that perceives the consistency between the two views using a contrastive prediction learning scheme. We conduct thorough experiments on two downstream tasks to assess the proposed model, i.e., land use clustering and region popularity prediction. The experimental results demonstrate that our model outperforms state-of-the-art baseline methods significantly in urban region representation learning.



Paperid:965
Authors:Ke Liang, Sihang Zhou, Meng Liu, Yue Liu, Wenxuan Tu, Yi Zhang, Liming Fang, Zhe Liu, Xinwang Liu
School of Computer, National University of Defense Technology, School of Intelligence Science and Technology, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, Nanjing University of Aeronautics and Astronautics, Zhejiang Lab, School of Computer, National University of Defense Technology
Abstract:
Crime prediction is a crucial yet challenging task within urban computing, which benefits public safety and resource optimization. Over the years, various models have been proposed, and spatialtemporal hypergraph learning models have recently shown outstanding performances. However, three correlations underlying crime are ignored, thus hindering the performance of previous models. Specifically, there are two spatial correlations and one temporal correlation, i.e., (1) co-occurrence of different types of crimes (type spatial correlation), (2) the closer to the crime center, the more dangerous it is around the neighborhood area (neighbor spatial correlation), and (3) the closer between two timestamps, the more relevant events are (hawkes temporal correlation). To this end, we propose Hawkes-enhanced Spatial-Temporal Hypergraph Contrastive Learning framework (HCL), which mines the aforementioned correlations via two specific strategies. Concretely, contrastive learning strategies are designed for two spatial correlations, and hawkes process modeling is adopted for temporal correlations. Extensive experiments demonstrate the promising capacities of HCL from four aspects, i.e., superiority, transferability, effectiveness, and sensitivity.



Paperid:966
Authors:Jiang Lin, Yaping Yan
Southeast University, Southeast University
Abstract:
Data augmentation methods are commonly integrated into the training of anomaly detection models. Previous approaches have primarily focused on replicating realworld anomalies or enhancing diversity, without considering that the standard of anomaly varies across different classes, potentially leading to a biased training distribution. This paper analyzes crucial traits of simulated anomalies that contribute to the training of reconstructive networks and condenses them into several methods, thus creating a comprehensive framework by selectively utilizing appropriate combinations. Furthermore, we integrate this framework with a reconstruction-based approach and concurrently propose a split training strategy that alleviates the overfitting issue while avoiding introducing interference to the reconstruction process. The evaluations conducted on the MVTec anomaly detection dataset demonstrate that our method outperforms the previous state-of-the-art approach, particularly in terms of object classes. We also generate a simulated dataset comprising anomalies with diverse characteristics, and experimental results demonstrate that our approach exhibits promising potential for generalizing effectively to various unseen anomalies encountered in real-world scenarios.



Paperid:967
Authors:Xinyu Lin, Wenjie Wang, Jujia Zhao, Yongqi Li, Fuli Feng, Tat-Seng Chua
National University of Singapore, National University of Singapore, National University of Singapore, The Hong Kong Polytechnic University, MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, National University of Singapore
Abstract:
Collaborative Filtering (CF) recommender models highly depend on useritem interactions to learn CF representations, thus falling short of recommending cold-start items. To address this issue, prior studies mainly introduce item features (e.g., thumbnails) for cold-start item recommendation. They learn a feature extractor on warm-start items to align feature representations with interactions, and then leverage the feature extractor to extract the feature representations of cold-start items for interaction prediction. Unfortunately, the features of cold-start items, especially the popular ones, tend to diverge from those of warm-start ones due to temporal feature shifts, preventing the feature extractor from accurately learning feature representations of cold-start items. To alleviate the impact of temporal feature shifts, we consider using Distributionally Robust Optimization (DRO) to enhance the generation ability of the feature extractor. Nonetheless, existing DRO methods face an inconsistency issue: the worse-case warm-start items emphasized during DRO training might not align well with the cold-start item distribution. To capture the temporal feature shifts and combat this inconsistency issue, we propose a novel temporal DRO with new optimization objectives, namely, 1) to integrate a worst-case factor to improve the worst-case performance, and 2) to devise a shifting factor to capture the shifting trend of item features and enhance the optimization of the potentially popular groups in cold-start items. Substantial experiments on three real-world datasets validate the superiority of our temporal DRO in enhancing the generalization ability of cold-start recommender models.



Paperid:968
Authors:Jiajun Liu, Wenjun Ke, Peng Wang, Ziyu Shang, Jinhua Gao, Guozheng Li, Ke Ji, Yanhe Liu
School of Computer Science and Engineering, Southeast University, School of Computer Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Institute of Computing Technology, Chinese Academy of Sciences, School of Computer Science and Engineering, Southeast University, School of Computer Science and Engineering, Southeast University, School of Computer Science and Engineering, Southeast University
Abstract:
Traditional knowledge graph embedding (KGE) methods typically require preserving the entire knowledge graph (KG) with significant training costs when new knowledge emerges. To address this issue, the continual knowledge graph embedding (CKGE) task has been proposed to train the KGE model by learning emerging knowledge efficiently while simultaneously preserving decent old knowledge. However, the explicit graph structure in KGs, which is critical for the above goal, has been heavily ignored by existing CKGE methods. On the one hand, existing methods usually learn new triples in a random order, destroying the inner structure of new KGs. On the other hand, old triples are preserved with equal priority, failing to alleviate catastrophic forgetting effectively. In this paper, we propose a competitive method for CKGE based on incremental distillation (IncDE), which considers the full use of the explicit graph structure in KGs. First, to optimize the learning order, we introduce a hierarchical strategy, ranking new triples for layerby-layer learning. By employing the inter- and intra-hierarchical orders together, new triples are grouped into layers based on the graph structure features. Secondly, to preserve the old knowledge effectively, we devise a novel incremental distillation mechanism, which facilitates the seamless transfer of entity representations from the previous layer to the next one, promoting old knowledge preservation. Finally, we adopt a two-stage training paradigm to avoid the over-corruption of old knowledge influenced by under-trained new knowledge. Experimental results demonstrate the superiority of IncDE over state-of-the-art baselines. Notably, the incremental distillation mechanism contributes to improvements of 0.2%-6.5% in the mean reciprocal rank (MRR) score. More exploratory experiments validate the effectiveness of IncDE in proficiently learning new knowledge while preserving old knowledge across all time steps.



Paperid:969
Authors:Jing Liu, Lele Sun, Weizhi Nie, Peiguang Jing, Yuting Su
Tianjin University, Tianjin University, Tianjin University, Tianjin University, Tianjin University
Abstract:
CrossDomain Recommendation (CDR) has been proven to effectively alleviate the data sparsity problem in Recommender System (RS). Recent CDR methods often disentangle user features into domain-invariant and domain-specific features for efficient cross-domain knowledge transfer. Despite showcasing robust performance, three crucial aspects remain unexplored for existing disentangled CDR approaches: i) The significance nuances of the interaction behaviors are ignored in generating disentangled features; ii) The user features are disentangled irrelevant to the individual items to be recommended; iii) The general knowledge transfer overlooks the user's personality when interacting with diverse items. To this end, we propose a Graph Disentangled Contrastive framework for CDR (GDCCDR) with personalized transfer by meta-networks. An adaptive parameter-free filter is proposed to gauge the significance of diverse interactions, thereby facilitating more refined disentangled representations. In sight of the success of Contrastive Learning (CL) in RS, we propose two CL-based constraints for item-aware disentanglement. Proximate CL ensures the coherence of domain-invariant features between domains, while eliminatory CL strives to disentangle features within each domains using mutual information between users and items. Finally, for domain-invariant features, we adopt meta-networks to achieve personalized transfer. Experimental results on four real-world datasets demonstrate the superiority of GDCCDR over state-of-the-art methods.



Paperid:970
Authors:Jintao Liu, Kaiwen Wei, Chenglong Liu
University of Chinese Academy of Sciences, University of Chinese Academy of Sciences, University of Chinese Academy of Sciences
Abstract:
Multimodal event causality reasoning aims to recognize the causal relations based on the given events and accompanying image pairs, requiring the model to have a comprehensive grasp of visual and textual information. However, existing studies fail to effectively model the relations of the objects within the image and capture the object interactions across the image pair, resulting in an insufficient understanding of visual information by the model. To address these issues, we propose a Scene Graph Enhanced Interaction Network (SEIN) in this paper, which can leverage the interactions of the generated scene graph for multimodal event causality reasoning. Specifically, the proposed method adopts a graph convolutional network to model the objects and their relations derived from the scene graph structure, empowering the model to exploit the rich structural and semantic information in the image adequately. To capture the object interactions between the two images, we design an optimal transportbased alignment strategy to match the objects across the images, which could help the model recognize changes in visual information and facilitate causality reasoning. In addition, we introduce a cross-modal fusion module to combine textual and visual features for causality prediction. Experimental results indicate that the proposed SEIN outperforms state-of-the-art methods on the Vis-Causal dataset.



Paperid:971
Authors:Qi Liu, Xuyang Hou, Defu Lian, Zhe Wang, Haoran Jin, Jia Cheng, Jun Lei
University of Science and Technology of China, Meituan, University of Science and Technology of China, Meituan, University of Science and Technology of China, Meituan, Meituan
Abstract:
Clickthrough rate (CTR) prediction is a vital task in industrial recommendation systems. Most existing methods focus on the network architecture design of the CTR model for better accuracy and suffer from the data sparsity problem. Especially in industrial recommendation systems, the widely applied negative sample down-sampling technique due to resource limitation worsens the problem, resulting in a decline in performance. In this paper, we propose Auxiliary Match Tasks for enhancing Click-Through Rate (AT4CTR) prediction accuracy by alleviating the data sparsity problem. Specifically, we design two match tasks inspired by collaborative filtering to enhance the relevance modeling between user and item. As the "click" action is a strong signal which indicates the user's preference towards the item directly, we make the first match task aim at pulling closer the representation between the user and the item regarding the positive samples. Since the user's past click behaviors can also be treated as the user him/herself, we apply the next item prediction as the second match task. For both the match tasks, we choose the InfoNCE as their loss function. The two match tasks can provide meaningful training signals to speed up the model's convergence and alleviate the data sparsity. We conduct extensive experiments on one public dataset and one large-scale industrial recommendation dataset. The result demonstrates the effectiveness of the proposed auxiliary match tasks. AT4CTR has been deployed in the real industrial advertising system and has gained remarkable revenue.



Paperid:972
Authors:Qiming Liu, Xiang Ao, Yuyao Guo, Qing He
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China. University of Chinese Academy of Sciences, Beijing 100049, China., Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China. University of Chinese Academy of Sciences, Beijing 100049, China. Institute of Intelligent Computing Technology, Suzhou., Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China. University of Chinese Academy of Sciences, Beijing 100049, China., Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China. University of Chinese Academy of Sciences, Beijing 100049, China.
Abstract:
Due to the widespread adoption of the costper-action(CPA) display strategy that demands a real-time conversion rate prediction(CVR), delayed feedback is becoming one of the major challenges in online advertising. As the true labels of a significant quantity of samples are only available after long delays, the observed training data are usually biased, harming the performance of models. Recent studies show integrating models with varying waiting windows to observe true labels is beneficial, but the aggregation framework remains far from reaching a consensus. In this work, we propose the Multi-Interval Screening and Synthesizing model (MISS for short) for online CVR prediction. We first design a multi-interval screening model with various output heads to produce accurate and distinctive estimates. Then a light-weight synthesizing model with an assembled training pipeline is applied to thoroughly exploit the knowledge and relationship among heads, obtaining reliable predictions. Extensive experiments on two real-world advertising datasets validate the effectiveness of our model.



Paperid:973
Authors:Ruoqi Liu, Lingfei Wu, Ping Zhang
The Ohio State University, Anytime.AI, The Ohio State University
Abstract:
Treatment effect estimation (TEE) is the task of determining the impact of various treatments on patient outcomes. Current TEE methods fall short due to reliance on limited labeled data and challenges posed by sparse and highdimensional observational patient data. To address the challenges, we introduce a novel pre-training and fine-tuning framework, KG-TREAT, which synergizes large-scale observational patient data with biomedical knowledge graphs (KGs) to enhance TEE. Unlike previous approaches, KG-TREAT constructs dual-focus KGs and integrates a deep bi-level attention synergy method for in-depth information fusion, enabling distinct encoding of treatment-covariate and outcome-covariate relationships. KG-TREAT also incorporates two pre-training tasks to ensure a thorough grounding and contextualization of patient data and KGs. Evaluation on four downstream TEE tasks shows KG-TREAT’s superiority over existing methods, with an average improvement of 7% in Area under the ROC Curve (AUC) and 9% in Influence Function-based Precision of Estimating Heterogeneous Effects (IF-PEHE). The effectiveness of our estimated treatment effects is further affirmed by alignment with established randomized clinical trial findings.



Paperid:974
Authors:Weiming Liu, Chaochao Chen, Xinting Liao, Mengling Hu, Yanchao Tan, Fan Wang, Xiaolin Zheng, Yew Soon Ong
Zhejiang university, Zhejiang University, Zhejiang University, Zhejiang University, Fuzhou University, Zhejiang University, Zhejiang University, Nanyang Technological University, Nanyang View, Singapore
Abstract:
With the rapid development of Internet and Web techniques, CrossDomain Recommendation (CDR) models have been widely explored for resolving the data-sparsity and cold-start problem. Meanwhile, most CDR models should utilize explicit domain-shareable information (e.g., overlapped users or items) for knowledge transfer across domains. However, this assumption may not be always satisfied since users and items are always non-overlapped in real practice. The performance of many previous works will be severely impaired when these domain-shareable information are not available. To address the aforementioned issues, we propose the Joint Preference Exploration and Dynamic Embedding Transportation model (JPEDET) in this paper which is a novel framework for solving the CDR problem when users and items are non-overlapped. JPEDET includes two main modules, i.e., joint preference exploration module and dynamic embedding transportation module. The joint preference exploration module aims to fuse rating and review information for modelling user preferences. The dynamic embedding transportation module is set to share knowledge via neural ordinary equations for dual transformation across domains. Moreover, we innovatively propose the dynamic transport flow equipped with linear interpolation guidance on barycentric Wasserstein path for achieving accurate and bidirectional transformation. Our empirical study on Amazon datasets demonstrates that JPEDET significantly outperforms the state-of-the-art models under the CDR setting.



Paperid:975
Authors:Xiangyu Liu, Yang Liu, Wei Hu
Nanjing University, Nanjing University, Nanjing University
Abstract:
Knowledge graphs (KGs) often contain various errors. Previous works on detecting errors in KGs mainly rely on triplet embedding from graph structure. We conduct an empirical study and find that these works struggle to discriminate noise from semanticallysimilar correct triplets. In this paper, we propose a KG error detection model CCA to integrate both textual and graph structural information from triplet reconstruction for better distinguishing semantics. We design interactive contrastive learning to capture the differences between textual and structural patterns. Furthermore, we construct realistic datasets with semantically-similar noise and adversarial noise. Experimental results demonstrate that CCA outperforms state-of-the-art baselines, especially on semantically-similar noise and adversarial noise.



Paperid:976
Authors:Yu-An Liu, Ruqing Zhang, Mingkun Zhang, Wei Chen, Maarten de Rijke, Jiafeng Guo, Xueqi Cheng
CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, University of Amsterdam, Amsterdam, The Netherlands, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Abstract:
Neural ranking models (NRMs) have shown great success in information retrieval (IR). But their predictions can easily be manipulated using adversarial examples, which are crafted by adding imperceptible perturbations to legitimate documents. This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs. By incorporating adversarial examples into training data, adversarial training has become the de facto defense approach to adversarial attacks against NRMs. However, this defense mechanism is subject to a tradeoff between effectiveness and adversarial robustness. In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs. We decompose the robust ranking error into two components, i.e., a natural ranking error for effectiveness evaluation and a boundary ranking error for assessing adversarial robustness. Then, we define the perturbation invariance of a ranking model and prove it to be a differentiable upper bound on the boundary ranking error for attainable computation. Informed by our theoretical analysis, we design a novel perturbation-invariant adversarial training (PIAT) method for ranking models to achieve a better effectiveness-robustness trade-off. We design a regularized surrogate loss, in which one term encourages the effectiveness to be maximized while the regularization term encourages the output to be smooth, so as to improve adversarial robustness. Experimental results on several ranking models demonstrate the superiority of PITA compared to existing adversarial defenses.



Paperid:977
Authors:Zehua Liu, Zimeng Li, Jingyuan Wang, Yue He
School of Computer Science and Engineering, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China School of Economics and Management, Beihang University, Beijing, China Key Laboratory of Data Intelligence and Management (Beihang University), Ministry of Industry and Information Technology, Beijing, China, Department of Computer Science and Technology, Tsinghua University, Beijing, China
Abstract:
Significance testing aims to determine whether a proposition about the population distribution is the truth or not given observations. However, traditional significance testing often needs to derive the distribution of the testing statistic, failing to deal with complex nonlinear relationships. In this paper, we propose to conduct Full Bayesian Significance Testing for neural networks, called nFBST, to overcome the limitation in relationship characterization of traditional approaches. A Bayesian neural network is utilized to fit the nonlinear and multidimensional relationships with small errors and avoid hard theoretical derivation by computing the evidence value. Besides, nFBST can test not only global significance but also local and instance-wise significance, which previous testing methods don't focus on. Moreover, nFBST is a general framework that can be extended based on the measures selected, such as Grad-nFBST, LRP-nFBST, DeepLIFT-nFBST, LIME-nFBST. A range of experiments on both simulated and real data are conducted to show the advantages of our method.



Paperid:978
Authors:Xiao Long, Liansheng Zhuang, Aodi Li, Jiuchang Wei, Houqiang Li, Shafei Wang
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Peng Cheng Laboratory
Abstract:
Knowledge graph embedding (KGE) is an efficient and scalable method for knowledge graph completion. However, most existing KGE methods suffer from the challenge of multiple relation semantics, which often degrades their performance. This is because most KGE methods learn fixed continuous vectors for entities (relations) and make deterministic entity predictions to complete the knowledge graph, which hardly captures multiple relation semantics. To tackle this issue, previous works try to learn complex probabilistic embeddings instead of fixed embeddings but suffer from heavy computational complexity. In contrast, this paper proposes a simple yet efficient framework namely the Knowledge Graph Diffusion Model (KGDM) to capture the multiple relation semantics in prediction. Its key idea is to cast the problem of entity prediction into conditional entity generation. Specifically, KGDM estimates the probabilistic distribution of target entities in prediction through Denoising Diffusion Probabilistic Models (DDPM). To bridge the gap between continuous diffusion models and discrete KGs, two learnable embedding functions are defined to map entities and relation to continuous vectors. To consider connectivity patterns of KGs, a Conditional Entity Denoiser model is introduced to generate target entities conditioned on given entities and relations. Extensive experiments demonstrate that KGDM significantly outperforms existing stateof-the-art methods in three benchmark datasets.



Paperid:979
Authors:Ming Lu, Zhihao Duan, Fengqing Zhu, Zhan Ma
School of Electronic Science and Engineering, Nanjing University Interdisciplinary Research Center for Future Intelligent Chips (Chip-X), Nanjing University, Elmore Family School of Electrical and Computer Engineering, Purdue University, Elmore Family School of Electrical and Computer Engineering, Purdue University, School of Electronic Science and Engineering, Nanjing University
Abstract:
Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a singlescale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss.



Paperid:980
Authors:Haitong Luo, Xuying Meng, Suhang Wang, Hanyun Cao, Weiyao Zhang, Yequan Wang, Yujun Zhang
Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences Purple Mountain Laboratories, Pennsylvania State University, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, BAAI, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Nanjing Institute of InforSuperBahn
Abstract:
Modeling complementary relationships greatly helps recommender systems to accurately and promptly recommend the subsequent items when one item is purchased. Unlike traditional similar relationships, items with complementary relationships may be purchased successively (such as iPhone and Airpods Pro), and they not only share relevance but also exhibit dissimilarity. Since the two attributes are opposites, modeling complementary relationships is challenging. Previous attempts to exploit these relationships have either ignored or oversimplified the dissimilarity attribute, resulting in ineffective modeling and an inability to balance the two attributes. Since Graph Neural Networks (GNNs) can capture the relevance and dissimilarity between nodes in the spectral domain, we can leverage spectralbased GNNs to effectively understand and model complementary relationships. In this study, we present a novel approach called Spectral-based Complementary Graph Neural Networks (SComGNN) that utilizes the spectral properties of complementary item graphs. We make the first observation that complementary relationships consist of low-frequency and mid-frequency components, corresponding to the relevance and dissimilarity attributes, respectively. Based on this spectral observation, we design spectral graph convolutional networks with low-pass and mid-pass filters to capture the low-frequency and mid-frequency components. Additionally, we propose a two-stage attention mechanism to adaptively integrate and balance the two attributes. Experimental results on four e-commerce datasets demonstrate the effectiveness of our model, with SComGNN significantly outperforming existing baseline models.



Paperid:981
Authors:Haiping Ma, Changqian Wang, Hengshu Zhu, Shangshang Yang, Xiaoming Zhang, Xingyi Zhang
Department of Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, China, Department of Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, China, Career Science Lab, BOSS Zhipin, School of Artificial Intelligence, Anhui University, China, Department of Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, China, School of Computer Science and Technology, Anhui University, China
Abstract:
Cognitive diagnosis is a crucial task in computeraided education, aimed at evaluating students' proficiency levels across various knowledge concepts through exercises. Current models, however, primarily rely on students' answered exercises, neglecting the complex and rich information contained in un-interacted exercises. While recent research has attempted to leverage the data within un-interacted exercises linked to interacted knowledge concepts, aiming to address the long-tail issue, these studies fail to fully explore the informative, un-interacted exercises related to broader knowledge concepts. This oversight results in diminished performance when these models are applied to comprehensive datasets. In response to this gap, we present the Collaborative-aware Mixed Exercise Sampling (CMES) framework, which can effectively exploit the information present in un-interacted exercises linked to un-interacted knowledge concepts. Specifically, we introduce a novel universal sampling module where the training samples comprise not merely raw data slices, but enhanced samples generated by combining weight-enhanced attention mixture techniques. Given the necessity of real response labels in cognitive diagnosis, we also propose a ranking-based pseudo feedback module to regulate students' responses on generated exercises. The versatility of the CMES framework bolsters existing models and improves their adaptability. Finally, we demonstrate the effectiveness and interpretability of our framework through comprehensive experiments on real-world datasets.



Paperid:982
Authors:Haokai Ma, Ruobing Xie, Lei Meng, Xin Chen, Xu Zhang, Leyu Lin, Zhanhui Kang
School of Software, Shandong University, China, Tencent, China, Shandong Research Institute of Industrial Technology, China School of Software, Shandong University, China, Tencent, China, Tencent, China, Tencent, China, Tencent, China
Abstract:
Pioneering efforts have verified the effectiveness of the diffusion models in exploring the informative uncertainty for recommendation. Considering the difference between recommendation and image synthesis tasks, existing methods have undertaken tailored refinements to the diffusion and reverse process. However, these approaches typically use the highestscore item in corpus for user interest prediction, leading to the ignorance of the user's generalized preference contained within other items, thereby remaining constrained by the data sparsity issue. To address this issue, this paper presents a novel Plug-in Diffusion Model for Recommendation (PDRec) framework, which employs the diffusion model as a flexible plugin to jointly take full advantage of the diffusion-generating user preferences on all items. Specifically, PDRec first infers the users' dynamic preferences on all items via a time-interval diffusion model and proposes a Historical Behavior Reweighting (HBR) mechanism to identify the high-quality behaviors and suppress noisy behaviors. In addition to the observed items, PDRec proposes a Diffusion-based Positive Augmentation (DPA) strategy to leverage the top-ranked unobserved items as the potential positive samples, bringing in informative and diverse soft signals to alleviate data sparsity. To alleviate the false negative sampling issue, PDRec employs Noise-free Negative Sampling (NNS) to select stable negative samples for ensuring effective model optimization. Extensive experiments and analyses on four datasets have verified the superiority of the proposed PDRec over the state-of-the-art baselines and showcased the universality of PDRec as a flexible plugin for commonly-used sequential encoders in different recommendation scenarios. The code is available in https://github.com/hulkima/PDRec.



Paperid:983
Authors:Yijun Ma, Chaozhuo Li, Xiao Zhou
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing University of Posts and Telecommunications, Gaoling School of Artificial Intelligence, Renmin University of China
Abstract:
Graph neural networks (GNNs) are commonly employed in collaborative friend recommendation systems. Nevertheless, recent studies reveal a notable performance gap, particularly for users with limited connections, commonly known as tail users, in contrast to their counterparts with abundant connections (head users). Uniformly treating head and tail users poses two challenges for tail user preference learning: (C1) Label Sparsity, as tail users typically possess limited labels; and (C2) Neighborhood Sparsity, where tail users exhibit sparse observable friendships, leading to distinct preference distributions and performance degradation compared to head users. In response to these challenges, we introduce TailSTEAK, an innovative framework that combines self-training with enhanced knowledge distillation for tail user representation learning. To address(C1), we present Tail-STEAK-base, a two-stage self-training framework. In the first stage, only head users and their accurate connections are utilized for training, while pseudo links are generated for tail users in the second stage. To tackle (C2), we propose two data augmentation-based self-knowledge distillation pretext tasks. These tasks are seamlessly integrated into different stages of Tail-STEAK-base, culminating in the comprehensive Tail-STEAK framework. Extensive experiments, conducted on state-of-the-art GNN-based friend recommendation models, substantiate the efficacy of Tail-STEAK in significantly improving tail user performance. Our code and data are publicly available at https://github.com/antman9914/Tail-STEAK.



Paperid:984
Authors:Yanhu Mo, Xiao Wang, Shaohua Fan, Chuan Shi
Beijing University of Posts and Telecommunications, Beihang University, Tsinghua Univerisity Key Laboratory of Big Data Artificial Intelligence in Transportation, Ministry of Education(Beijing Jiaotong University), Beijing University of Posts and Telecommunications
Abstract:
Graph contrastive learning (GCL), learning the node representation by contrasting two augmented graphs in a selfsupervised way, has attracted considerable attention. GCL is usually believed to learn the invariant representation. However, does this understanding always hold in practice? In this paper, we first study GCL from the perspective of causality. By analyzing GCL with the structural causal model (SCM), we discover that traditional GCL may not well learn the invariant representations due to the non-causal information contained in the graph. How can we fix it and encourage the current GCL to learn better invariant representations? The SCM offers two requirements and motives us to propose a novel GCL method. Particularly, we introduce the spectral graph augmentation to simulate the intervention upon non-causal factors. Then we design the invariance objective and independence objective to better capture the causal factors. Specifically, (i) the invariance objective encourages the encoder to capture the invariant information contained in causal variables, and (ii) the independence objective aims to reduce the influence of confounders on the causal variables. Experimental results demonstrate the effectiveness of our approach on node classification tasks.



Paperid:985
Authors:Jiaxin Pan, Mojtaba Nayyeri, Yinan Li, Steffen Staab
University of Stuttgart, Stuttgart, Germany, University of Stuttgart, Stuttgart, Germany, University of Stuttgart, Stuttgart, Germany, University of Stuttgart, Stuttgart, Germany University of Southampton, Southampton, United Kingdom
Abstract:
Temporal knowledge graphs represent temporal facts (s,p,o,?) relating a subject s and an object o via a relation label p at time ?, where ? could be a time point or time interval. Temporal knowledge graphs may exhibit static temporal patterns at distinct points in time and dynamic temporal patterns between different timestamps. In order to learn a rich set of static and dynamic temporal patterns and apply them for inference, several embedding approaches have been suggested in the literature. However, as most of them resort to single underlying embedding spaces, their capability to model all kinds of temporal patterns was severely limited by having to adhere to the geometric property of their one embedding space. We lift this limitation by an embedding approach that maps temporal facts into a product space of several heterogeneous geometric subspaces with distinct geometric properties, i.e.\ Complex, Dual, and Splitcomplex spaces. In addition, we propose a temporal-geometric attention mechanism to integrate information from different geometric subspaces conveniently according to the captured relational and temporal information. Experimental results on standard temporal benchmark datasets favorably evaluate our approach against state-of-the-art models.



Paperid:986
Authors:Furong Peng, Jiachen Luo, Xuan Lu, Sheng Wang, Feijiang Li
Institute of Big Data Science and Industry, Shanxi University School of Computer and Information Technology, Shanxi University, Institute of Big Data Science and Industry, Shanxi University School of Computer and Information Technology, Shanxi University, College of Physics and Electronic Engineering, Shanxi University, School of Automation, Zhengzhou University of Aeronautics, Institute of Big Data Science and Industry, Shanxi University School of Computer and Information Technology, Shanxi University
Abstract:
Most deep learningbased time series clustering models concentrate on data representation in a separate process from clustering. This leads to that clustering loss cannot guide feature extraction. Moreover, most methods solely analyze data from the temporal domain, disregarding the potential within the frequency domain. To address these challenges, we introduce a novel end-to-end Cross-Domain Contrastive learning model for time series Clustering (CDCC). Firstly, it integrates the clustering process and feature extraction using contrastive constraints at both cluster-level and instance-level. Secondly, the data is encoded simultaneously in both temporal and frequency domains, leveraging contrastive learning to enhance within-domain representation. Thirdly, cross-domain constraints are proposed to align the latent representations and category distribution across domains. With the above strategies, CDCC not only achieves end-to-end output but also effectively integrates frequency domains. Extensive experiments and visualization analysis are conducted on 40 time series datasets from UCR, demonstrating the superior performance of the proposed model.



Paperid:987
Authors:Chenyang Qiu, Guoshun Nan, Tianyu Xiong, Wendi Deng, Di Wang, Zhiyang Teng, Lijuan Sun, Qimei Cui, Xiaofeng Tao
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Nanyang Technological University, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Graph convolution networks (GCNs) are extensively utilized in various graph tasks to mine knowledge from spatial data. Our study marks the pioneering attempt to quantitatively investigate the GCN robustness over omnipresent heterophilic graphs for node classification. We uncover that the predominant vulnerability is caused by the structural outof-distribution (OOD) issue. This finding motivates us to present a novel method that aims to harden GCNs by automatically learning Latent Homophilic Structures over heterophilic graphs. We term such a methodology as LHS. To elaborate, our initial step involves learning a latent structure by employing a novel self-expressive technique based on multi-node interactions. Subsequently, the structure is refined using a pairwisely constrained dual-view contrastive learning approach. We iteratively perform the above procedure, enabling a GCN model to aggregate information in a homophilic way on heterophilic graphs. Armed with such an adaptable structure, we can properly mitigate the structural OOD threats over heterophilic graphs. Experiments on various benchmarks show the effectiveness of the proposed LHS approach for robust GCNs.



Paperid:988
Authors:Guojing Ren, Xiao Ding, Xiao-Ke Xu, Hai-Feng Zhang
Institutes of Physical Science and Information Technology, Anhui University, Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Mathematical Science, Anhui University, Computational Communication Research Center and School of Journalism and Communication, Beijing Normal University, Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Mathematical Science, Anhui University
Abstract:
Link prediction is a fundamental task in network analysis, with the objective of predicting missing or potential links. While existing studies have mainly concentrated on single networks, it is worth noting that numerous realworld networks exhibit interconnectedness. For example, individuals often register on various social media platforms to access diverse services, such as chatting, tweeting, blogging, and rating movies. These platforms share a subset of users and are termed multilayer networks. The interlayer links in such networks hold valuable information that provides more comprehensive insights into the network structure. To effectively exploit this complementary information and enhance link prediction in the target network, we propose a novel cross-network embedding method. This method aims to represent different networks in a shared latent space, preserving proximity within single networks as well as consistency across multilayer networks. Specifically, nodes can aggregate messages from aligned nodes in other layers. Extensive experiments conducted on real-world datasets demonstrate the superior performance of our proposed method for link prediction in multilayer networks.



Paperid:989
Authors:Jihyeon Seong, Jungmin Kim, Jaesik Choi
KAIST, KAIST, KAIST INEEJI
Abstract:
In Time Series Classification (TSC), temporal pooling methods that consider sequential information have been proposed. However, we found that each temporal pooling has a distinct mechanism, and can perform better or worse depending on time series data. We term this fixed pooling mechanism a single perspective of temporal poolings. In this paper, we propose a novel temporal pooling method with diverse perspective learning: Selection over Multiple Temporal Poolings (SoMTP). SoM-TP dynamically selects the optimal temporal pooling among multiple methods for each data by attention. The dynamic pooling selection is motivated by the ensemble concept of Multiple Choice Learning (MCL), which selects the best among multiple outputs. The pooling selection by SoM-TP's attention enables a non-iterative pooling ensemble within a single classifier. Additionally, we define a perspective loss and Diverse Perspective Learning Network (DPLN). The loss works as a regularizer to reflect all the pooling perspectives from DPLN. Our perspective analysis using Layer-wise Relevance Propagation (LRP) reveals the limitation of a single perspective and ultimately demonstrates diverse perspective learning of SoM-TP. We also show that SoM-TP outperforms CNN models based on other temporal poolings and state-of-the-art models in TSC with extensive UCR/UEA repositories.



Paperid:990
Authors:Bin Shang, Yinliang Zhao, Jun Liu, Di Wang
Xi’an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong Univerisity, School of Computer Science and Technology, Xidian University
Abstract:
Recently, an enormous amount of research has emerged on multimodal knowledge graph completion (MKGC), which seeks to extract knowledge from multimodal data and predict the most plausible missing facts to complete a given multimodal knowledge graph (MKG). However, existing MKGC approaches largely ignore that visual information may introduce noise and lead to uncertainty when adding them to the traditional KG embeddings due to the contribution of each associated image to entity is different in diverse link scenarios. Moreover, treating each triple independently when learning entity embeddings leads to local structural and the whole graph information missing. To address these challenges, we propose a novel link aware fusion and aggregation based multimodal knowledge graph completion model named LAFA, which is composed of link aware fusion module and link aware aggregation module. The link aware fusion module alleviates noise of irrelevant visual information by calculating the importance between an entity and its associated images in different link scenarios, and fuses the visual and structural embeddings according to the importance through our proposed modality embedding fusion mechanism. The link aware aggregation module assigns neighbor structural information to a given central entity by calculating the importance between the entity and its neighbors, and aggregating the fused embeddings through linear combination according to the importance. Extensive experiments on standard datasets validate that LAFA can obtain stateof-the-art performance.



Paperid:991
Authors:Bin Shang, Yinliang Zhao, Jun Liu, Di Wang
Xi’an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong Univerisity, School of Computer Science and Technology, Xidian University
Abstract:
Knowledge graph completion (KGC) aims to study the embedding representation to solve the incompleteness of knowledge graphs (KGs). Recently, graph convolutional networks (GCNs) and graph attention networks (GATs) have been widely used in KGC tasks by capturing neighbor information of entities. However, Both GCNs and GATs based KGC models have their limitations, and the best method is to analyze the neighbors of each entity (prevalidating), while this process is prohibitively expensive. Furthermore, the representation quality of the embeddings can affect the aggregation of neighbor information (message passing). To address the above limitations, we propose a novel knowledge graph completion model with mixed geometry message and trainable convolutional attention network named MGTCA. Concretely, the mixed geometry message function generates rich neighbor message by integrating spatially information in the hyperbolic space, hypersphere space and Euclidean space jointly. To complete the autonomous switching of graph neural networks (GNNs) and eliminate the necessity of pre-validating the local structure of KGs, a trainable convolutional attention network is proposed by comprising three types of GNNs in one trainable formulation. Furthermore, a mixed geometry scoring function is proposed, which calculates scores of triples by novel prediction function and similarity function based on different geometric spaces. Extensive experiments on three standard datasets confirm the effectiveness of our innovations, and the performance of MGTCA is significantly improved compared to the state-of-the-art approaches.



Paperid:992
Authors:Shuyao Shang, Zhengyang Shan, Guangxing Liu, LunQian Wang, XingHua Wang, Zekai Zhang, Jinglin Zhang
Shandong University, Shandong University, Shandong University, Linyi University, Linyi University, Qilu UNiversity of Technology, Shandong University
Abstract:
Adapting the Diffusion Probabilistic Model (DPM) for direct image superresolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN predicted image. In contrast to the common diffusion-based methods that directly use LR space to guide the noise towards HR space, ResDiff utilizes the CNN’s initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.



Paperid:993
Authors:Yehjin Shin, Jeongwhan Choi, Hyowon Wi, Noseong Park
Yonsei University, Yonsei University, Yonsei University, Yonsei University
Abstract:
Sequential recommendation (SR) models based on Transformers have achieved remarkable successes. The selfattention mechanism of Transformers for computer vision and natural language processing suffers from the oversmoothing problem, i.e., hidden representations becoming similar to tokens. In the SR domain, we, for the first time, show that the same problem occurs. We present pioneering investigations that reveal the low-pass filtering nature of self-attention in the SR, which causes oversmoothing. To this end, we propose a novel method called Beyond Self-Attention for Sequential Recommendation (BSARec), which leverages the Fourier transform to i) inject an inductive bias by considering fine-grained sequential patterns and ii) integrate low and high-frequency information to mitigate oversmoothing. Our discovery shows significant advancements in the SR domain and is expected to bridge the gap for existing Transformer-based SR models. We test our proposed approach through extensive experiments on 6 benchmark datasets. The experimental results demonstrate that our model outperforms 7 baseline methods in terms of recommendation performance. Our code is available at https://github.com/yehjin-shin/BSARec.



Paperid:994
Authors:Zixing Song, Ziqiao Meng, Irwin King
The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong
Abstract:
Many significant problems involving crystal property prediction from 3D structures have limited labeled data due to expensive and timeconsuming physical simulations or lab experiments. To overcome this challenge, we propose a pretrain-finetune framework for the crystal property prediction task named CrysDiff based on diffusion models. In the pre-training phase, CrysDiff learns the latent marginal distribution of crystal structures via the reconstruction task. Subsequently, CrysDiff can be fine-tuned under the guidance of the new sparse labeled data, fitting the conditional distribution of the target property given the crystal structures. To better model the crystal geometry, CrysDiff notably captures the full symmetric properties of the crystals, including the invariance of reflection, rotation, and periodic translation. Extensive experiments demonstrate that CrysDiff can significantly improve the performance of the downstream crystal property prediction task on multiple target properties, outperforming all the SOTA pre-training models for crystals with good margins on the popular JARVIS-DFT dataset.



Paperid:995
Authors:Liangcai Su, Junwei Pan, Ximei Wang, Xi Xiao, Shijie Quan, Xihua Chen, Jie Jiang
Shenzhen International Graduate School, Tsinghua University, Tencent Inc., Tencent Inc., Shenzhen International Graduate School, Tsinghua University, Tencent Inc., Tencent Inc., Tencent Inc.
Abstract:
Multitask learning (MTL) has gained significant popularity in recommender systems as it enables simultaneous optimization of multiple objectives. A key challenge in MTL is negative transfer, but existing studies explored negative transfer on all samples, overlooking the inherent complexities within them. We split the samples according to the relative amount of positive feedback among tasks. Surprisingly, negative transfer still occurs in existing MTL methods on samples that receive comparable feedback across tasks. Existing work commonly employs a shared-embedding paradigm, limiting the ability of modeling diverse user preferences on different tasks. In this paper, we introduce a novel Shared and Task-specific EMbeddings (STEM) paradigm that aims to incorporate both shared and task-specific embeddings to effectively capture task-specific user preferences. Under this paradigm, we propose a simple model STEM-Net, which is equipped with an All Forward Task-specific Backward gating network to facilitate the learning of task-specific embeddings and direct knowledge transfer across tasks. Remarkably, STEM-Net demonstrates exceptional performance on comparable samples, achieving positive transfer. Comprehensive evaluation on three public MTL recommendation datasets demonstrates that STEM-Net outperforms state-of-the-art models by a substantial margin. Our code is released at https://github.com/LiangcaiSu/STEM.



Paperid:996
Authors:Zhixiang Su, Di Wang, Chunyan Miao, Lizhen Cui
School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore WeBank-NTU Joint Research Institute on Fintech, NTU, Singapore, Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore WeBank-NTU Joint Research Institute on Fintech, NTU, Singapore, School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore WeBank-NTU Joint Research Institute on Fintech, NTU, Singapore SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (SDU), China, SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (SDU), China School of Software, SDU, China
Abstract:
Aiming to accurately predict missing edges representing relations between entities, which are pervasive in realworld Knowledge Graphs (KGs), relation prediction plays a critical role in enhancing the comprehensiveness and utility of KGs. Recent research focuses on path-based methods due to their inductive and explainable properties. However, these methods face a great challenge when lots of reasoning paths do not form Closed Paths (CPs) in the KG. To address this challenge, we propose Anchoring Path Sentence Transformer (APST) by introducing Anchoring Paths (APs) to alleviate the reliance of CPs. Specifically, we develop a search-based description retrieval method to enrich entity descriptions and an assessment mechanism to evaluate the rationality of APs. APST takes both APs and CPs as the inputs of a unified Sentence Transformer architecture, enabling comprehensive predictions and high-quality explanations. We evaluate APST on three public datasets and achieve state-of-the-art (SOTA) performance in 30 of 36 transductive, inductive, and few-shot experimental settings.



Paperid:997
Authors:Colin Sullivan, Mo Tiwari, Sebastian Thrun
Stanford University, Stanford University, Stanford University
Abstract:
Decision trees remain one of the most popular machine learning models today, largely due to their outof-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree.



Paperid:998
Authors:Jie Sun, Zhaoying Ding, Xiaoshuang Chen, Qi Chen, Yincheng Wang, Kaiqiao Zhan, Ben Wang
Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Kuaishou Technology
Abstract:
The watch time is a significant indicator of user satisfaction in video recommender systems. However, the prediction of watch time as a target variable is often hindered by its highly imbalanced distribution with a scarcity of observations for larger target values and overpopulated samples for small values. State-of-the-art watch time prediction models discretize the continuous watch time into a set of buckets in order to consider the distribution of watch time. However, it is highly uninvestigated how these discrete buckets should be created from the continuous watch time distribution, and existing discretization approaches suffer from either a large learning error or a large restoration error. To address this challenge, we propose a Classification-Restoration framework with Error-Adaptive-Discretization (CREAD) to accurately predict the watch time. The proposed framework contains a discretization module, a classification module, and a restoration module. It predicts the watch time through multiple classification problems. The discretization process is a key contribution of the CREAD framework. We theoretically analyze the impacts of the discretization on the learning error and the restoration error, and then propose the error-adaptive discretization (EAD) technique to better balance the two errors, which achieves better performance over traditional discretization approaches. We conduct detailed offline evaluations on a public dataset and an industrial dataset, both showing performance gains through the proposed approach. Moreover, We have fully launched our framework to an online video platform, which resulted in a significant increase in users' video watch time by 0.29% through A/B testing. These results highlight the effectiveness of the CREAD framework in watch time prediction in video recommender systems.



Paperid:999
Authors:Ke Sun, Pei Liu, Pengfei Li, Zhifang Liao
Central South University, Central South University, Tsinghua University, Central South University
Abstract:
Traffic prediction is the core issue of Intelligent Transportation Systems. Recently, researchers have tended to use complex structures, such as transformerbased structures, for tasks such as traffic prediction. Notably, traffic data is simpler to process compared to text and images, which raises questions about the necessity of these structures. Additionally, when handling traffic data, researchers tend to manually design the model structure based on the data features, which makes the structure of traffic prediction redundant and the model generalizability limited. To address the above, we introduce the ‘ModWaveMLP’—A multilayer perceptron (MLP) based model designed according to mode decomposition and wavelet noise reduction information learning concepts. The model is based on simple MLP structure, which achieves the separation and prediction of different traffic modes and does not depend on additional features introduced such as the topology of the traffic network. By performing experiments on real-world datasets METR-LA and PEMS-BAY, our model achieves SOTA, outperforms GNN and transformer-based models, and outperforms those that introduce additional feature data with better generalizability, and we further demonstrate the effectiveness of the various parts of the model through ablation experiments. This offers new insights to subsequent researchers involved in traffic model design. The code is available at: https://github.com/Kqingzheng/ModWaveMLP.



Paperid:1000
Authors:Li Sun, Zhenhao Huang, Zixi Wang, Feiyang Wang, Hao Peng, Philip S. Yu
North China Electric Power University, North China Electric Power University, North China Electric Power University, Beijing University of Posts and Telecommunications, Beihang University, UIC
Abstract:
Graphs are typical nonEuclidean data of complex structures. In recent years, Riemannian graph representation learning has emerged as an exciting alternative to Euclidean ones. However, Riemannian methods are still in an early stage: most of them present a single curvature (radius) regardless of structural complexity, suffer from numerical instability due to the exponential/logarithmic map, and lack the ability to capture motif regularity. In light of the issues above, we propose the problem of Motif-aware Riemannian Graph Representation Learning, seeking a numerically stable encoder to capture motif regularity in a diverse-curvature manifold without labels. To this end, we present a novel Motif-aware Riemannian model with Generative-Contrastive learning (MotifRGC), which conducts a minmax game in Riemannian manifold in a self-supervised manner. First, we propose a new type of Riemannian GCN (D-GCN), in which we construct a diverse-curvature manifold by a product layer with the diversified factor, and replace the exponential/logarithmic map by a stable kernel layer. Second, we introduce a motif-aware Riemannian generative-contrastive learning to capture motif regularity in the constructed manifold and learn motif-aware node representation without external labels. Empirical results show the superiority of MofitRGC.



Paperid:1001
Authors:Yifei Sun, Qi Zhu, Yang Yang, Chunping Wang, Tianyu Fan, Jiajun Zhu, Lei Chen
Zhejiang University, University of Illisnois Urbana Champaign, Zhejiang University, Finvolution, Zhejiang University, Zhejiang University, Finvolution
Abstract:
Recently, the paradigm of pretraining and fine-tuning graph neural networks has been intensively studied and applied in a wide range of graph mining tasks. Its success is generally attributed to the structural consistency between pre-training and downstream datasets, which, however, does not hold in many real-world scenarios. Existing works have shown that the structural divergence between pre-training and downstream graphs significantly limits the transferability when using the vanilla fine-tuning strategy. This divergence leads to model overfitting on pre-training graphs and causes difficulties in capturing the structural properties of the downstream graphs. In this paper, we identify the fundamental cause of structural divergence as the discrepancy of generative patterns between the pre-training and downstream graphs. Furthermore, we propose G-Tuning to preserve the generative patterns of downstream graphs. Given a downstream graph G, the core idea is to tune the pre-trained GNN so that it can reconstruct the generative patterns of G, the graphon W. However, the exact reconstruction of a graphon is known to be computationally expensive. To overcome this challenge, we provide a theoretical analysis that establishes the existence of a set of alternative graphons called graphon bases for any given graphon. By utilizing a linear combination of these graphon bases, we can efficiently approximate W. This theoretical finding forms the basis of our model, as it enables effective learning of the graphon bases and their associated coefficients. Compared with existing algorithms, G-Tuning demonstrates consistent performance improvement in 7 in-domain and 7 out-of-domain transfer learning experiments.



Paperid:1002
Authors:Nils Philipp Walter, Jonas Fischer, Jilles Vreeken
CISPA Helmholtz Center for Information Security, Harvard University, CISPA Helmholtz Center for Information Security
Abstract:
Discovering patterns in data that best describe the differences between classes allows to hypothesize and reason about classspecific mechanisms. In molecular biology, for example, these bear the promise of advancing the understanding of cellular processes differing between tissues or diseases, which could lead to novel treatments. To be useful in practice, methods that tackle the problem of finding such differential patterns have to be readily interpretable by domain experts, and scalable to the extremely high-dimensional data. In this work, we propose a novel, inherently interpretable binary neural network architecture Diffnaps that extracts differential patterns from data. Diffnaps is scalable to hundreds of thousands of features and robust to noise, thus overcoming the limitations of current state-of-the-art methods in large-scale applications such as in biology. We show on synthetic and real world data, including three biological applications, that unlike its competitors, Diffnaps consistently yields accurate, succinct, and interpretable class descriptions.



Paperid:1003
Authors:Hai Wan, Pingjia Liang, Jianfeng Du, Weilin Luo, Rongzhen Ye, Bo Peng
School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, Guangdong University of Foreign Studies Bigmath Technology, Shenzhen, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University
Abstract:
It is important to automatically discover the underlying treestructured formulae from large amounts of data. In this paper, we examine learning linear temporal logic on finite traces (LTLf) formulae, which is a tree structure syntactically and characterizes temporal properties semantically. Its core challenge is to bridge the gap between the concise tree-structured syntax and the complex LTLf semantics. Besides, the learning quality is endangered by explosion of the search space and wrong search bias guided by imperfect data. We tackle these challenges by proposing an LTLf encoding method to parameterize a neural network so that the neural computation is able to simulate the inference of LTLf formulae. We first identify faithful LTLf encoding, a subclass of LTLf encoding, which has a one-to-one correspondence to LTLf formulae. Faithful encoding guarantees that the learned parameter assignment of the neural network can directly be interpreted to an LTLf formula. With such an encoding method, we then propose an end-to-end approach, TLTLf, to learn LTLf formulae through neural networks parameterized by our LTLf encoding method. Experimental results demonstrate that our approach achieves state-of-the-art performance with up to 7% improvement in accuracy, highlighting the benefits of introducing the faithful LTLf encoding.



Paperid:1004
Authors:Zhijing Wan, Zhixiang Wang, Yuran Wang, Zheng Wang, Hongyuan Zhu, Shin'ichi Satoh
National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Wuhan University Hubei Key Laboratory of Multimedia and Network Communication Engineering, The University of Tokyo National Institute of Informatics, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Wuhan University Hubei Key Laboratory of Multimedia and Network Communication Engineering, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Wuhan University Hubei Key Laboratory of Multimedia and Network Communication Engineering, Institute for Infocomm Research (I2R) & Centre for Frontier AI Research (CFAR), A*STAR, National Institute of Informatics The University of Tokyo
Abstract:
Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.



Paperid:1005
Authors:Binwu Wang, Pengkun Wang, Yudong Zhang, Xu Wang, Zhengyang Zhou, Lei Bai, Yang Wang
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China, Shanghai AI Laboratory, University of Science and Technology of China
Abstract:
With the progress of urban transportation systems, a significant amount of highquality traffic data is continuously collected through streaming manners, which has propelled the prosperity of the field of spatial-temporal graph prediction. In this paper, rather than solely focusing on designing powerful models for static graphs, we shift our focus to spatial-temporal graph prediction in the dynamic scenario, which involves a continuously expanding and evolving underlying graph. To address inherent challenges, a decoupled learning framework (DLF) is proposed in this paper, which consists of a spatial-temporal graph learning network (DSTG) with a specialized decoupling training strategy. Incorporating inductive biases of time-series structures, DSTG can interpret time dependencies into latent trend and seasonal terms. To enable prompt adaptation to the evolving distribution of the dynamic graph, our decoupling training strategy is devised to iteratively update these two types of patterns. Specifically, for learning seasonal patterns, we conduct thorough training for the model using a long time series (e.g., three months of data). To enhance the learning ability of the model, we also introduce the masked auto-encoding mechanism. During this period, we frequently update trend patterns to expand new information from dynamic graphs. Considering both effectiveness and efficiency, we develop a subnet sampling strategy to select a few representative nodes for fine-tuning the weights of the model. These sampled nodes cover unseen patterns and previously learned patterns. Experiments on dynamic spatial-temporal graph datasets further demonstrate the competitive performance, superior efficiency, and strong scalability of the proposed framework.



Paperid:1006
Authors:Kai Wang, Haoyu Liu, Zhipeng Hu, Xiaochuan Feng, Minghao Zhao, Shiwei Zhao, Runze Wu, Xudong Shen, Tangjie Lv, Changjie Fan
Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc., Fuxi AI Lab, NetEase Inc.
Abstract:
Matchmaking is a core task in esports and online games, as it contributes to player engagement and further influences the game's lifecycle. Previous methods focus on creating fair games at all times. They divide players into different tiers based on skill levels and only select players from the same tier for each game. Though this strategy can ensure fair matchmaking, it is not always good for player engagement. In this paper, we propose a novel Engagement-oriented Matchmaking (EnMatch) framework to ensure fair games and simultaneously enhance player engagement. Two main issues need to be addressed. First, it is unclear how to measure the impact of different team compositions and confrontations on player engagement during the game considering the variety of player characteristics. Second, such a detailed consideration on every single player during matchmaking will result in an NP-hard combinatorial optimization problem with non-linear objectives. In light of these challenges, we turn to real-world data analysis to reveal engagement-related factors. The resulting insights guide the development of engagement modeling, enabling the estimation of quantified engagement before a match is completed. To handle the combinatorial optimization problem, we formulate the problem into a reinforcement learning framework, in which a neural combinatorial optimization problem is built and solved. The performance of EnMatch is finally demonstrated through the comparison with other state-of-the-art methods based on several real-world datasets and online deployments on two games.



Paperid:1007
Authors:Ke Wang, Yanmin Zhu, Tianzi Zang, Chunyang Wang, Mengyuan Jing
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Hangzhou Innovation Institute, Beihang University, Hangzhou, China, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, Nanjing University of Aeronautics and Astronautics, Nanjing, China, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Abstract:
Designed to establish potential relations and distill highorder representations, graph-based recommendation systems continue to reveal promising results by jointly modeling ratings and reviews. However, existing studies capture simple review relations, failing to (1) completely explore hidden connections between users (or items), (2) filter out redundant information derived from reviews, and (3) model the behavioral association between rating and review interactions. To address these challenges, we propose a review-enhanced hierarchical contrastive learning, namely ReHCL. First, ReHCL constructs topic and semantic graphs to fully mine review relations from different views. Moreover, a cross-view graph contrastive learning is used to achieve enhancement of node representations and extract useful review knowledge. Meanwhile, we design a neighbor-based positive sampling to capture the graph-structured similarity between topic and semantic views, further performing efficient contrast and reducing redundant noise. Next, we propose a cross-modal contrastive learning to match the rating and review representations, by exploring the association between ratings and reviews. Lastly, these two contrastive learning modes form a hierarchical contrastive learning task, which is applied to enhance the final recommendation task. Extensive experiments verify the superiority of ReHCL compared with state-of-the-arts.



Paperid:1008
Authors:Luyao Wang, Pengnian Qi, Xigang Bao, Chunlai Zhou, Biao Qin
Renmin University of China, Renmin University of China, Renmin University of China, Renmin University of China, Renmin University of China
Abstract:
Multimodal entity alignment (MMEA) aims to identify equivalent entities between two multi-modal knowledge graphs for integration. Unfortunately, prior arts have attempted to improve the interaction and fusion of multi-modal information, which have overlooked the influence of modal-specific noise and the usage of labeled and unlabeled data in semi-supervised settings. In this work, we introduce a Pseudo-label Calibration Multi-modal Entity Alignment (PCMEA) in a semi-supervised way. Specifically, in order to generate holistic entity representations, we first devise various embedding modules and attention mechanisms to extract visual, structural, relational, and attribute features. Different from the prior direct fusion methods, we next propose to exploit mutual information maximization to filter the modal-specific noise and to augment modal-invariant commonality. Then, we combine pseudo-label calibration with momentum-based contrastive learning to make full use of the labeled and unlabeled data, which improves the quality of pseudo-label and pulls aligned entities closer. Finally, extensive experiments on two MMEA datasets demonstrate the effectiveness of our PCMEA, which yields state-of-the-art performance.



Paperid:1009
Authors:Wenbo Wang, Bingquan Liu, Lili Shan, Chengjie Sun, Ben Chen, Jian Guan
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Alibaba Group, Harbin Engineering University
Abstract:
Existing coldstart recommendation methods often adopt item-level alignment strategies to align the content feature and the collaborative feature of warm items for model training, however, cold items in the test stage have no historical interactions with users to obtain the collaborative feature. These existing models ignore the aforementioned condition of cold items in the training stage, resulting in the performance limitation. In this paper, we propose a preference aware dual contrastive learning based recommendation model (PAD-CLRec), where the user preference is explored to take into account the condition of cold items for feature alignment. Here, the user preference is obtained by aggregating a group of collaborative feature of the warm items in the user's purchase records. Then, a group-level alignment between the user preference and the item's content feature can be realized via a proposed preference aware contrastive function for enhancing cold-item recommendation. In addition, a joint objective function is introduced to achieve a better trade-off between the recommendation performance of warm items and cold items from both item-level and group-level perspectives, yielding better overall recommendation performance. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method, and the results show the superiority of our method, as compared with the state-of-the-arts.



Paperid:1010
Authors:Yu Wang, Zexue He, Zhankui He, Hao Xu, Julian McAuley
University of California, San Diego, University of California, San Diego, UC San Diego, University of California, San Diego, UCSD
Abstract:
Understanding and accurately explaining compatibility relationships between fashion items is a challenging problem in the burgeoning domain of AIdriven outfit recommendations. Present models, while making strides in this area, still occasionally fall short, offering explanations that can be elementary and repetitive. This work aims to address these shortcomings by introducing the Pair Fashion Explanation (PFE) dataset, a unique resource that has been curated to illuminate these compatibility relationships. Furthermore, we propose an innovative two stage pipeline model that leverages this dataset. This fine-tuning allows the model to generate explanations that convey the compatibility relationships between items. Our experiments showcase the model's potential in crafting descriptions that are knowledgeable, aligned with ground-truth matching correlations, and that produce understandable and informative descriptions, as assessed by both automatic metrics and human evaluation. Our code and data are released at https://github.com/wangyu-ustc/PairFashionExplanation.



Paperid:1011
Authors:Yu Wang, Ronghang Zhu, Pengsheng Ji, Sheng Li
LinkedIn Corporation, School of Computing, University of Georgia, Department of Statistics, University of Georgia, School of Data Science, University of Virginia
Abstract:
Domain adaptation has become an attractive learning paradigm, as it can leverage source domains with rich labels to deal with classification tasks in an unlabeled target domain. A few recent studies develop domain adaptation approaches for graphstructured data. In the case of node classification task, current domain adaptation methods only focus on the closed-set setting, where source and target domains share the same label space. A more practical assumption is that the target domain may contain new classes that are not included in the source domain. Therefore, in this paper, we introduce a novel and challenging problem for graphs, i.e., open-set domain adaptive node classification, and propose a new approach to solve it. Specifically, we develop an algorithm for efficient knowledge transfer from a labeled source graph to an unlabeled target graph under a separate domain alignment (SDA) strategy, in order to learn discriminative feature representations for the target graph. Our goal is to not only correctly classify target nodes into the known classes, but also classify unseen types of nodes into an unknown class. Experimental results on real-world datasets show that our method outperforms existing methods on graph domain adaptation.



Paperid:1012
Authors:Yiwei Wei, Shaozu Yuan, Hengyang Zhou, Longbiao Wang, Zhiling Yan, Ruosong Yang, Meng Chen
Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University China University of Petroleum(Beijing) at Karamay, JD AI Research, China University of Petroleum(Beijing) at Karamay, Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University Huiyan Technology (Tianjin) Co., Ltd, JD AI Research, JD AI Research, Yep AI
Abstract:
Multimodal sarcasm detection, aiming to detect the ironic sentiment within multimodal social data, has gained substantial popularity in both the natural language processing and computer vision communities. Recently, graphbased studies by drawing sentimental relations to detect multimodal sarcasm have made notable advancements. However, they have neglected exploiting graph-based global semantic congruity from existing instances to facilitate the prediction, which ultimately hinders the model's performance. In this paper, we introduce a new inference paradigm that leverages global graph-based semantic awareness to handle this task. Firstly, we construct fine-grained multimodal graphs for each instance and integrate them into semantic space to draw graph-based relations. During inference, we leverage global semantic congruity to retrieve k-nearest neighbor instances in semantic space as references for voting on the final prediction. To enhance the semantic correlation of representation in semantic space, we also introduce label-aware graph contrastive learning to further improve the performance. Experimental results demonstrate that our model achieves state-of-the-art (SOTA) performance in multimodal sarcasm detection. The code will be available at https://github.com/upccpu/G2SAM.



Paperid:1013
Authors:Yuecen Wei, Haonan Yuan, Xingcheng Fu, Qingyun Sun, Hao Peng, Xianxian Li, Chunming Hu
Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China School of Software, Beihang University, Beijing, China Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China, Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China, Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, China, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China School of Software, Beihang University, Beijing, China
Abstract:
Hierarchy is an important and commonly observed topological property in realworld graphs that indicate the relationships between supervisors and subordinates or the organizational behavior of human groups. As hierarchy is introduced as a new inductive bias into the Graph Neural Networks (GNNs) in various tasks, it implies latent topological relations for attackers to improve their inference attack performance, leading to serious privacy leakage issues. In addition, existing privacy-preserving frameworks suffer from reduced protection ability in hierarchical propagation due to the deficiency of adaptive upper-bound estimation of the hierarchical perturbation boundary. It is of great urgency to effectively leverage the hierarchical property of data while satisfying privacy guarantees. To solve the problem, we propose the Poincar\'e Differential Privacy framework, named PoinDP, to protect the hierarchy-aware graph embedding based on hyperbolic geometry. Specifically, PoinDP first learns the hierarchy weights for each entity based on the Poincar\'e model in hyperbolic space. Then, the Personalized Hierarchy-aware Sensitivity is designed to measure the sensitivity of the hierarchical structure and adaptively allocate the privacy protection strength. Besides, Hyperbolic Gaussian Mechanism (HGM) is proposed to extend the Gaussian mechanism in Euclidean space to hyperbolic space to realize random perturbations that satisfy differential privacy under the hyperbolic space metric. Extensive experiment results on five real-world datasets demonstrate the proposed PoinDP’s advantages of effective privacy protection while maintaining good performance on the node classification task.



Paperid:1014
Authors:Dayan Wu, Qinghang Su, Bo Li, Weiping Wang
Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, CAS, China
Abstract:
Deep incremental hashing has become a subject of considerable interest due to its capability to learn hash codes in an incremental manner, eliminating the need to generate codes for classes that have already been learned. However, accommodating more classes requires longer hash codes, and regenerating database codes becomes inevitable when code expansion is required. In this paper, we present a unified deep hash framework that can simultaneously learn new classes and increase hash code capacity. Specifically, we design a triplechannel asymmetric framework to optimize a new CNN model with a target code length and a code projection matrix. This enables us to directly generate hash codes for new images, and efficiently generate expanded hash codes for original database images from the old ones with the learned projection matrix. Meanwhile, we propose a pairwise-label-based incremental similarity-preserving loss to optimize the new CNN model, which can incrementally preserve new similarities while maintaining the old ones. Additionally, we design a double-end quantization loss to reduce the quantization error from new and original query images. As a result, our method efficiently embeds both new and original similarities into the expanded hash codes, while keeping the original database codes unchanged. We conduct extensive experiments on three widely-used image retrieval benchmarks, demonstrating that our method can significantly reduce the time required to expand existing database codes, while maintaining state-of-the-art retrieval performance.



Paperid:1015
Authors:Likang Wu, Zhaopeng Qiu, Zhi Zheng, Hengshu Zhu, Enhong Chen
University of Science and Technology of China Career Science Lab, BOSS Zhipin, Career Science Lab, BOSS Zhipin, University of Science and Technology of China Career Science Lab, BOSS Zhipin, Career Science Lab, BOSS Zhipin, University of Science and Technology of China
Abstract:
Large Language Models (LLMs) have revolutionized natural language processing tasks, demonstrating their exceptional capabilities in various domains. However, their potential for graph semantic mining in job recommendations remains largely unexplored. This paper focuses on unveiling the capability of large language models in understanding behavior graphs and leveraging this understanding to enhance recommendations in online recruitment, including promoting outof-distribution (OOD) applications. We present a novel framework that harnesses the rich contextual information and semantic representations provided by large language models to analyze behavior graphs and uncover underlying patterns and relationships. Specifically, we propose a meta-path prompt constructor that aids LLM recommender in grasping the semantics of behavior graphs for the first time and design a corresponding path augmentation module to alleviate the prompt bias introduced by path-based sequence input. By facilitating this capability, our framework enables personalized and accurate job recommendations for individual users. We evaluate the effectiveness of our approach on comprehensive real-world datasets and demonstrate its ability to improve the relevance and quality of recommended results. This research not only sheds light on the untapped potential of large language models but also provides valuable insights for developing advanced recommendation systems in the recruitment market. The findings contribute to the growing field of natural language processing and offer practical implications for enhancing job search experiences.



Paperid:1016
Authors:Hongjie Xia, Huijie Ao, Long Li, Yu Liu, Sen Liu, Guangnan Ye, Hongfeng Chai
School of Computer Science, Fudan University, Shanghai, China Institute of FinTech, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Institute of FinTech, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Institute of FinTech, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Institute of FinTech, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Institute of FinTech, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Institute of FinTech, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China Institute of FinTech, Fudan University, Shanghai, China
Abstract:
Quantitative stock selection is one of the most challenging FinTech tasks due to the nonstationary dynamics and complex market dependencies. Existing studies rely on channel mixing methods, exacerbating the issue of distribution shift in financial time series. Additionally, complex model structures they build make it difficult to handle very long sequences. Furthermore, most of them are based on predefined stock relationships thus making it difficult to capture the dynamic and highly volatile stock markets. To address the above issues, in this paper, we propose Channel-Independent based Spatio-Temporal Hypergraph Pre-trained Attention Networks (CI-STHPAN), a two-stage framework for stock selection, involving Transformer and HGAT based stock time series self-supervised pre-training and stock-ranking based downstream task fine-tuning. We calculate the similarity of stock time series of different channel in dynamic intervals based on Dynamic Time Warping (DTW), and further construct channel-independent stock dynamic hypergraph based on the similarity. Experiments with NASDAQ and NYSE markets data over five years show that our framework outperforms SOTA approaches in terms of investment return ratio (IRR) and Sharpe ratio (SR). Additionally, we find that even without introducing graph information, self-supervised learning based on the vanilla Transformer Encoder also surpasses SOTA results. Notable improvements are gained on the NYSE market. It is mainly attributed to the improvement of fine-tuning approach on Information Coefficient (IC) and Information Ratio based IC (ICIR), indicating that the fine-tuning method enhances the accuracy and stability of the model prediction.



Paperid:1017
Authors:Weiwei Xiao, Yongyong Chen, Qiben Shan, Yaowei Wang, Jingyong Su
Harbin Institute of Technology, Shenzhen PengCheng Laboratory, Harbin Institute of Technology, Shenzhen, PengCheng Laboratory, PengCheng Laboratory, Harbin Institute of Technology, Shenzhen PengCheng Laboratory
Abstract:
Training neural networks with good generalization requires large computational costs in many deep learning methods due to largescale datasets and over-parameterized models. Despite the emergence of a number of coreset selection methods to reduce the computational costs, the problem of coreset distribution bias, i.e., the skewed distribution between the coreset and the entire dataset, has not been well studied. In this paper, we find that the closer the feature distribution of the coreset is to that of the entire dataset, the better the generalization performance of the coreset, particularly under extreme pruning. This motivates us to propose a simple yet effective method for coreset selection to alleviate the distribution bias between the coreset and the entire dataset, called feature distribution matching (FDMat). Unlike gradient-based methods, which selects samples with larger gradient values or approximates gradient values of the entire dataset, FDMat aims to select coreset that is closest to feature distribution of the entire dataset. Specifically, FDMat transfers coreset selection as an optimal transport problem from the coreset to the entire dataset in feature embedding spaces. Moreover, our method shows strong robustness due to the removal of samples far from the distribution, especially for the entire dataset containing noisy and class-imbalanced samples. Extensive experiments on multiple benchmarks show that FDMat can improve the performance of coreset selection than existing coreset methods. The code is available at https://github.com/successhaha/FDMat.



Paperid:1018
Authors:Bo Xiong, Mojtaba Nayyeri, Linhao Luo, Zihao Wang, Shirui Pan, Steffen Staab
University of Stuttgart, University of Stuttgart, Monash University, University of Stuttgart, Griffith University, University of Stuttgart University of Southampton
Abstract:
Reasoning with knowledge graphs (KGs) has primarily focused on tripleshaped facts. Recent advancements have been explored to enhance the semantics of these facts by incorporating more potent representations, such as hyper-relational facts. However, these approaches are limited to atomic facts, which describe a single piece of information. This paper extends beyond atomic facts and delves into nested facts, represented by quoted triples where subjects and objects are triples themselves (e.g., ((BarackObama, holds_position, President), succeed_by, (DonaldTrump, holds_position, President))). These nested facts enable the expression of complex semantics like situations over time and logical patterns} over entities and relations. In response, we introduce NestE, a novel KG embedding approach that captures the semantics of both atomic and nested factual knowledge. NestE represents each atomic fact as a 1*3 matrix, and each nested relation is modeled as a 3*3 matrix that rotates the 1*3 atomic fact matrix through matrix multiplication. Each element of the matrix is represented as a complex number in the generalized 4D hypercomplex space, including (spherical) quaternions, hyperbolic quaternions, and split-quaternions. Through thorough analysis, we demonstrate the embedding's efficacy in capturing diverse logical patterns over nested facts, surpassing the confines of first-order logic-like expressions. Our experimental results showcase NestE's significant performance gains over current baselines in triple prediction and conditional link prediction. The code and pre-trained models are open available at https://github.com/xiongbo010/NestE.



Paperid:1019
Authors:Fan Xu, Nan Wang, Hao Wu, Xuezhi Wen, Xibin Zhao, Hai Wan
University of Science and Technology of China, Beijing Jiaotong University, University of Science and Technology of China, Beijing Jiaotong University, Tsinghua University, Tsinghua University
Abstract:
Graphbased fraud detection (GFD) can be regarded as a challenging semi-supervised node binary classification task. In recent years, Graph Neural Networks (GNN) have been widely applied to GFD, characterizing the anomalous possibility of a node by aggregating neighbor information. However, fraud graphs are inherently heterophilic, thus most of GNNs perform poorly due to their assumption of homophily. In addition, due to the existence of heterophily and class imbalance problem, the existing models do not fully utilize the precious node label information. To address the above issues, this paper proposes a semi-supervised GNN-based fraud detector SEC-GFD. This detector includes a hybrid filtering module and a local environmental constraint module, the two modules are utilized to solve heterophily and label utilization problem respectively. The first module starts from the perspective of the spectral domain, and solves the heterophily problem to a certain extent. Specifically, it divides the spectrum into various mixed-frequency bands based on the correlation between spectrum energy distribution and heterophily. Then in order to make full use of the node label information, a local environmental constraint module is adaptively designed. The comprehensive experimental results on four real-world fraud detection datasets denote that SEC-GFD outperforms other competitive graph-based fraud detectors. We release our code at https://github.com/Sunxkissed/SEC-GFD.



Paperid:1020
Authors:Pengwei Yan, Kaisong Song, Zhuoren Jiang, Yangyang Kang, Tianqianjin Lin, Changlong Sun, Xiaozhong Liu
Department of Information Resources Management, Zhejiang University, Hangzhou, 310058, China Alibaba Group, Hangzhou, 311121, China, Alibaba Group, Hangzhou, 311121, China Northeastern University, Shenyang, 110819, China, Department of Information Resources Management, Zhejiang University, Hangzhou, 310058, China, Alibaba Group, Hangzhou, 311121, China, Department of Information Resources Management, Zhejiang University, Hangzhou, 310058, China Alibaba Group, Hangzhou, 311121, China, Alibaba Group, Hangzhou, 311121, China, Computer Science Department, Worcester Polytechnic Institute, Worcester, 01609-2280, MA, USA
Abstract:
While selfsupervised graph pretraining techniques have shown promising results in various domains, their application still experiences challenges of limited topology learning, human knowledge dependency, and incompetent multi-level interactions. To address these issues, we propose a novel solution, Dual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which introduces a unique dual-level pretraining structure that orchestrates node-level and subgraph-level pretext tasks. Unlike prior approaches, DGPM autonomously uncovers significant graph motifs through an edge pooling module, aligning learned motif similarities with graph kernel-based similarities. A cross-matching task enables sophisticated node-motif interactions and novel representation learning. Extensive experiments on 15 datasets validate DGPM's effectiveness and generalizability, outperforming state-of-the-art methods in unsupervised representation learning and transfer learning settings. The autonomously discovered motifs demonstrate the potential of DGPM to enhance robustness and interpretability.



Paperid:1021
Authors:Yuguang Yan, Yuanlin Chen, Shibo Wang, Hanrui Wu, Ruichu Cai
School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, College of Information Science and Technology, Jinan University, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China Guangdong Provincial Key Laboratory of Public Finance and Taxation with Big Data Application, Guangzhou, China
Abstract:
Hypergraph captures highorder information in structured data and obtains much attention in machine learning and data mining. Existing approaches mainly learn representations for hypervertices by transforming a hypergraph to a standard graph, or learn representations for hypervertices and hyperedges in separate spaces. In this paper, we propose a hypergraph expansion method to transform a hypergraph to a standard graph while preserving high-order information. Different from previous hypergraph expansion approaches like clique expansion and star expansion, we transform both hypervertices and hyperedges in the hypergraph to vertices in the expanded graph, and construct connections between hypervertices or hyperedges, so that richer relationships can be used in graph learning. Based on the expanded graph, we propose a learning model to embed hypervertices and hyperedges in a joint representation space. Compared with the method of learning separate spaces for hypervertices and hyperedges, our method is able to capture common knowledge involved in hypervertices and hyperedges, and also improve the data efficiency and computational efficiency. To better leverage structure information, we minimize the graph reconstruction loss to preserve the structure information in the model. We perform experiments on both hypervertex classification and hyperedge classification tasks to demonstrate the effectiveness of our proposed method.



Paperid:1022
Authors:Cheng Yang, Jixi Liu, Yunhe Yan, Chuan Shi
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Despite the remarkable success of graph neural networks (GNNs) in modeling graphstructured data, like other machine learning models, GNNs are also susceptible to making biased predictions based on sensitive attributes, such as race and gender. For fairness consideration, recent state-of-the-art (SOTA) methods propose to filter out sensitive information from inputs or representations, e.g., edge dropping or feature masking. However, we argue that such filtering-based strategies may also filter out some non-sensitive feature information, leading to a sub-optimal trade-off between predictive performance and fairness. To address this issue, we unveil an innovative neutralization-based paradigm, where additional Fairness-facilitating Features (F3) are incorporated into node features or representations before message passing. The F3 are expected to statistically neutralize the sensitive bias in node representations and provide additional nonsensitive information. We also provide theoretical explanations for our rationale, concluding that F3 can be realized by emphasizing the features of each node’s heterogeneous neighbors (neighbors with different sensitive attributes). We name our method as FairSIN, and present three implementation variants from both data-centric and model-centric perspectives. Experimental results on five benchmark datasets with three different GNN backbones show that FairSIN significantly improves fairness metrics while maintaining high prediction accuracies. Codes and appendix can be found at https://github.com/BUPT-GAMMA/FariSIN.



Paperid:1023
Authors:Mengyuan Yang, Mengying Zhu, Yan Wang, Linxun Chen, Yilei Zhao, Xiuyuan Wang, Bing Han, Xiaolin Zheng, Jianwei Yin
Zhejiang University, China, Zhejiang University, China, School of Computing, Macqaurie University, Australia, MYbank, Ant Group, China, Zhejiang University, China, Zhejiang University, China, MYbank, Ant Group, China, Zhejiang University, China, Zhejiang University, China
Abstract:
Large language modelbased explainable recommendation (LLM-based ER) systems can provide remarkable human-like explanations and have widely received attention from researchers. However, the original LLM-based ER systems face three low-quality problems in their generated explanations, i.e., lack of personalization, inconsistency, and questionable explanation data. To address these problems, we propose a novel LLM-based ER model denoted as LLM2ER to serve as a backbone and devise two innovative explainable quality reward models for fine-tuning such a backbone in a reinforcement learning paradigm, ultimately yielding a fine-tuned model denoted as LLM2ER-EQR, which can provide high-quality explanations. LLM2ER-EQR can generate personalized, informative, and consistent high-quality explanations learned from questionable-quality explanation datasets. Extensive experiments conducted on three real-world datasets demonstrate that our model can generate fluent, diverse, informative, and highly personalized explanations.



Paperid:1024
Authors:Yachao Yang, Yanfeng Sun, Shaofan Wang, Jipeng Guo, Junbin Gao, Fujiao Ju, Baocai Yin
Beijing University of Technology, Beijing University of Technology, Beijing University of Technology, Beijing University of Chemical Technology, University of Sydney, Australia, Beijing University of Technology, Beijing University of Technology
Abstract:
Graph Neural Networks (GNNs) have shown great performance in learning representations for graphstructured data. However, recent studies have found that the interference between topology and attribute can lead to distorted node representations. Most GNNs are designed based on homophily assumptions, thus they cannot be applied to graphs with heterophily. This research critically analyzes the propagation principles of various GNNs and the corresponding challenges from an optimization perspective. A novel GNN called Graph Neural Networks with Soft Association between Topology and Attribute (GNN-SATA) is proposed. Different embeddings are utilized to gain insights into attributes and structures while establishing their interconnections through soft association. Further as integral components of the soft association, a Graph Pruning Module (GPM) and Graph Augmentation Module (GAM) are developed. These modules dynamically remove or add edges to the adjacency relationships to make the model better fit with graphs with homophily or heterophily. Experimental results on homophilic and heterophilic graph datasets convincingly demonstrate that the proposed GNN-SATA effectively captures more accurate adjacency relationships and outperforms state-of-the-art approaches. Especially on the heterophilic graph dataset Squirrel, GNN-SATA achieves a 2.81% improvement in accuracy, utilizing merely 27.19% of the original number of adjacency relationships. Our code is released at https://github.com/wwwfadecom/GNN-SATA.



Paperid:1025
Authors:Zhen Yang, Zhou Shao, Yuxiao Dong, Jie Tang
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
Negative sampling stands as a pivotal technique in dense retrieval, essential for training effective retrieval models and significantly impacting retrieval performance. While existing negative sampling methods have made commendable progress by leveraging hard negatives, a comprehensive guiding principle for constructing negative candidates and designing negative sampling distributions is still lacking. To bridge this gap, we embark on a theoretical analysis of negative sampling in dense retrieval. This exploration culminates in the unveiling of the quasitriangular principle, a novel framework that elucidates the triangular-like interplay between query, positive document, and negative document. Fueled by this guiding principle, we introduce TriSampler, a straightforward yet highly effective negative sampling method. The keypoint of TriSampler lies in its ability to selectively sample more informative negatives within a prescribed constrained region. Experimental evaluation show that TriSampler consistently attains superior retrieval performance across a diverse of representative retrieval models.



Paperid:1026
Authors:Zhiguang Yang, Liufang Sang, Haoran Wang, Wenlong Chen, Lu Wang, Jie He, Changping Peng, Zhangang Lin, Chun Gan, Jingping Shao
JD.com, JD.com, JD.com, JD.com, JD.com, JD.com, JD.com, JD.com, JD.com, JD.com
Abstract:
Creativity is the heart and soul of advertising services. Effective creatives can create a winwin scenario: advertisers each target users and achieve marketing objectives more effectively, users more quickly find products of interest, and platforms generate more advertising revenue. With the advent of AI-Generated Content, advertisers now can produce vast amounts of creative content at a minimal cost. The current challenge lies in how advertising systems can select the most pertinent creative in real-time for each user personally. Existing methods typically perform serial ranking of ads or creatives, limiting the creative module in terms of both effectiveness and efficiency. In this paper, we propose for the first time a novel architecture for online parallel estimation of ads and creatives ranking, as well as the corresponding offline joint optimization model. The online architecture enables sophisticated personalized creative modeling while reducing overall latency. The offline joint model for CTR estimation allows mutual awareness and collaborative optimization between ads and creatives. Additionally, we optimize the offline evaluation metrics for the implicit feedback sorting task involved in ad creative ranking. We conduct extensive experiments to compare ours with two state-of-the-art approaches. The results demonstrate the effectiveness of our approach in both offline evaluations and real-world advertising platforms online in terms of response time, CTR, and CPM.



Paperid:1027
Authors:Zhirui Yang, Yulan Hu, Sheng Ouyang, Jingyu Liu, Shuqiang Wang, Xibo Ma, Wenhan Wang, Hanjing Su, Yong Liu
Renmin University of China, Renmin University of China School of Artificial Intelligence, University of Chinese Academy of Sciences, Renmin University of China School of Artificial Intelligence, University of Chinese Academy of Sciences, Renmin University of China, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences (CASIA), Tencent Inc, Tencent Inc, Renmin University of China
Abstract:
In the existing spectral GNNs, polynomialbased methods occupy the mainstream in designing a filter through the Laplacian matrix. However, polynomial combinations factored by the Laplacian matrix naturally have limitations in message passing (e.g., over-smoothing). Furthermore, most existing spectral GNNs are based on polynomial bases, which struggle to capture the high-frequency parts of the graph spectral signal. Additionally, we also find that even increasing the polynomial order does not change this situation, which means polynomial-based models have a natural deficiency when facing high-frequency signals. To tackle these problems, we propose WaveNet, which aims to effectively capture the high-frequency part of the graph spectral signal from the perspective of wavelet bases through reconstructing the message propagation matrix. We utilize Multi-Resolution Analysis (MRA) to model this question, and our proposed method can reconstruct arbitrary filters theoretically. We also conduct node classification experiments on real-world graph benchmarks and achieve superior performance on most datasets. Our code is available at https://github.com/Bufordyang/WaveNet



Paperid:1028
Authors:Xiaoyu You, Jianwei Xu, Mi Zhang, Zechen Gao, Min Yang
Fudan University, Fudan University, Fudan University, Fudan University, Fudan University
Abstract:
As societies become increasingly aware of data privacy, regulations require that private information about users must be removed from both database and ML models, which is more colloquially called `the right to be forgotten`. Such privacy problems of recommendation systems, which hold large amounts of private data, are drawing increasing attention. Recent research suggests dividing the preference data into multiple shards and training submodels with these shards and forgetting users' personal preference data by retraining the submodels of marked shards. Despite the computational efficiency development compared with retraining from scratch, the overall recommendation performance deteriorates after dividing the shards because the collaborative information contained in the training data is broken. In this paper, we aim to propose a forgetting framework for recommendation models that neither separate the training data nor jeopardizes the recommendation performance, named Recommendation Reverse Learning (RRL). Given the trained recommendation model and marked preference data, we devise Reverse BPR Objective (RBPR Objective) to finetune the recommendation model to force it to forget the marked data. Nevertheless, as the recommendation model encode the complex collaborative information among users, we propose to utilize Fisher Information Matrix (FIM) to estimate the influence of reverse learning on other users' collaborative information and guide the updates of representations. We conduct experiments on two representative recommendation models and three public benchmark datasets to verify the efficiency of RRL. To verify the forgetting completeness, we use RRL to make the recommendation model poisoned by shilling attacks forget malicious users.



Paperid:1029
Authors:Gengrui Zhang, Yao Wang, Xiaoshuang Chen, Hongyi Qian, Kaiqiao Zhan, Ben Wang
Kuaishou Technology, Kuaishou Technology, Kuaishou Technology, Beihang University, Kuaishou Technology, Kuaishou Technology
Abstract:
In recent years, there has been a growing interest in utilizing reinforcement learning (RL) to optimize longterm rewards in recommender systems. Since industrial recommender systems are typically designed as multi-stage systems, RL methods with a single agent face challenges when optimizing multiple stages simultaneously. The reason is that different stages have different observation spaces, and thus cannot be modeled by a single agent. To address this issue, we propose a novel UNidirectional-EXecution-based multi-agent Reinforcement Learning (UNEX-RL) framework to reinforce the long-term rewards in multi-stage recommender systems. We show that the unidirectional execution is a key feature of multi-stage recommender systems, bringing new challenges to the applications of multi-agent reinforcement learning (MARL), namely the observation dependency and the cascading effect. To tackle these challenges, we provide a cascading information chain (CIC) method to separate the independent observations from action-dependent observations and use CIC to train UNEX-RL effectively. We also discuss practical variance reduction techniques for UNEX-RL. Finally, we show the effectiveness of UNEX-RL on both public datasets and an online recommender system with over 100 million users. Specifically, UNEX-RL reveals a 0.558% increase in users' usage time compared with single-agent RL algorithms in online A/B experiments, highlighting the effectiveness of UNEX-RL in industrial recommender systems.



Paperid:1030
Authors:Hansong Zhang, Shikun Li, Pengju Wang, Dan Zeng, Shiming Ge
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100092, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100092, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100092, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China, Department of Communication Engineering, Shanghai University, Shanghai 200040, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100092, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:
Training stateof-the-art (SOTA) deep models often requires extensive data, resulting in substantial training and storage costs. To address these challenges, dataset condensation has been developed to learn a small synthetic set that preserves essential information from the original large-scale dataset. Nowadays, optimization-oriented methods have been the primary method in the field of dataset condensation for achieving SOTA results. However, the bi-level optimization process hinders the practical application of such methods to realistic and larger datasets. To enhance condensation efficiency, previous works proposed Distribution-Matching (DM) as an alternative, which significantly reduces the condensation cost. Nonetheless, current DM-based methods still yield less comparable results to SOTA optimization-oriented methods. In this paper, we argue that existing DM-based methods overlook the higher-order alignment of the distributions, which may lead to sub-optimal matching results. Inspired by this, we present a novel DM-based method named M3D for dataset condensation by Minimizing the Maximum Mean Discrepancy between feature representations of the synthetic and real images. By embedding their distributions in a reproducing kernel Hilbert space, we align all orders of moments of the distributions of real and synthetic images, resulting in a more generalized condensed set. Notably, our method even surpasses the SOTA optimization-oriented method IDC on the high-resolution ImageNet dataset. Extensive analysis is conducted to verify the effectiveness of the proposed method. Source codes are available at https://github.com/Hansong-Zhang/M3D.



Paperid:1031
Authors:Jinghui Zhang, Zhengjia Xu, Dingyang Lv, Zhan Shi, Dian Shen, Jiahui Jin, Fang Dong
Southeast University, Southeast University, Southeast University, Southeast University, Southeast University, Southeast University, Southeast University
Abstract:
Fraud detection on multirelation graphs aims to identify fraudsters in graphs. Graph Neural Network (GNN) models leverage graph structures to pass messages from neighbors to the target nodes, thereby enriching the representations of those target nodes. However, feature and structural inconsistency in the graph, owing to fraudsters' camouflage behaviors, diminish the suspiciousness of fraud nodes which hinders the effectiveness of GNN-based models. In this work, we propose DiG-In-GNN, Discriminative Feature Guided GNN against Inconsistency, to dig into graphs for fraudsters. Specifically, we use multi-scale contrastive learning from the perspective of the neighborhood subgraph where the target node is located to generate guidance nodes to cope with the feature inconsistency. Then, guided by the guidance nodes, we conduct fine-grained neighbor selection through reinforcement learning for each neighbor node to precisely filter nodes that can enhance the message passing and therefore alleviate structural inconsistency. Finally, the two modules are integrated together to obtain discriminable representations of the nodes. Experiments on three fraud detection datasets demonstrate the superiority of the proposed method DiG-In-GNN, which obtains up to 20.73% improvement over previous state-of-the-art methods. Our code can be found at https://github.com/GraphBerry/DiG-In-GNN.



Paperid:1032
Authors:Lingzi Zhang, Xin Zhou, Zhiwei Zeng, Zhiqi Shen
School of Computer Science and Engineering, Nanyang Technological University, Singapore Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore, Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore
Abstract:
Recent advances in sequential recommendation models have demonstrated the efficacy of integrating pretrained text embeddings with item ID embeddings to achieve superior performance. However, our study takes a unique perspective by exclusively focusing on the untapped potential of text embeddings, obviating the need for ID embeddings. We begin by implementing a pre-processing strategy known as whitening, which effectively transforms the anisotropic semantic space of pre-trained text embeddings into an isotropic Gaussian distribution. Comprehensive experiments reveal that applying whitening to pre-trained text embeddings in sequential recommendation models significantly enhances performance. Yet, a full whitening operation might break the potential manifold of items with similar text semantics. To retain the original semantics while benefiting from the isotropy of the whitened text features, we propose a Dual-view Whitening method for Sequential Recommendation (DWSRec), which leverages both fully whitened and relaxed whitened item representations as dual views for effective recommendations. We further examine the advantages of our approach through both empirical and theoretical analyses. Experiments on three public benchmark datasets show that DWSRec outperforms state-of-the-art methods for sequential recommendation.



Paperid:1033
Authors:Linhao Zhang, Li Jin, Guangluan Xu, Xiaoyu Li, Cai Xu, Kaiwen Wei, Nayu Liu, Haonan Liu
Aerospace Information Research Institute, Chinese Academy of Sciences; Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Aerospace Information Research Institute Chinese Academy of Sciences; Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Aerospace Information Research Institute, Chinese Academy of Sciences; Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Aerospace Information Research Institute, Chinese Academy of Sciences; Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, School of Computer Science and Technology, Xidian University, Aerospace Information Research Institute,Chinese Academy of Sciences; Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, School of Computer Science and Technology, Tiangong University, Aerospace Information Research Institute,Chinese Academy of Sciences; Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences
Abstract:
Understanding the emotional polarity of multimodal content with metaphorical characteristics, such as memes, poses a significant challenge in Multimodal Emotion Recognition (MER). Previous MER researches have overlooked the phenomenon of metaphorical alignment in multimedia content, which involves nonliteral associations between concepts to convey implicit emotional tones. Metaphor-agnostic MER methods may be misinformed by the isolated unimodal emotions, which are distinct from the real emotions blended in multimodal metaphors. Moreover, contextual semantics can further affect the emotions associated with similar metaphors, leading to the challenge of maintaining contextual compatibility. To address the issue of metaphorical alignment in MER, we propose to leverage a conditional generative approach for capturing metaphorical analogies. Our approach formulates schematic prompts and corresponding references based on theoretical foundations, which allows the model to better grasp metaphorical nuances. In order to maintain contextual sensitivity, we incorporate a disentangled contrastive matching mechanism, which undergoes curricular adjustment to regulate its intensity during the learning process. The automatic and human evaluation experiments on two benchmarks prove that, our model provides considerable and stable improvements in recognizing multimodal emotion with metaphor attributes.



Paperid:1034
Authors:Qin Zhang, Xiaowei Li, Jiexin Lu, Liping Qiu, Shirui Pan, Xiaojun Chen, Junyang Chen
Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University, Griffith University, Shenzhen University, Shenzhen University
Abstract:
Openset graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known classes. They are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. IND noise are training samples which are assigned incorrect labels. The existence of IND noise and OOD noise is prevalent, which usually cause the ambiguity problem, including the intra-class variety problem and the inter-class confusion problem. Thus, to explore robust open-set learning methods is necessary and difficult, and it becomes even more difficult for non-IID graph data. To this end, we propose a unified framework named ROG_PL to achieve robust open-set learning on complex noisy graph data, by introducing prototype learning. In specific, ROG_PL consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. The first module corrects noisy labels through similarity-based label propagation and removes low-confidence samples, to solve the intra-class variety problem caused by noise. The second module learns open-set prototypes for each known class via non-overlapped regions and remains both interior and border prototypes to remedy the inter-class confusion problem. The two modules are iteratively updated under the constraints of classification loss and prototype diversity loss. To the best of our knowledge, the proposed ROG_PL is the first robust open-set node classification method for graph data with complex noise. Experimental evaluations of ROG_PL on several benchmark graph datasets demonstrate that it has good performance.



Paperid:1035
Authors:Shengzhe Zhang, Liyi Chen, Chao Wang, Shuangli Li, Hui Xiong
University of Science and Technology of China, University of Science and Technology of China, HKUST Fok Ying Tung Research Institute, University of Science and Technology of China, Hong Kong University of Science and Tech
Abstract:
Sequential recommendation is a crucial task in understanding users' evolving interests and predicting their future behaviors. While existing approaches on sequence or graph modeling to learn interaction sequences of users have shown promising performance, how to effectively exploit temporal information and deal with the uncertainty noise in evolving user behaviors is still quite challenging. To this end, in this paper, we propose a Temporal Graph Contrastive Learning method for Sequential Recommendation (TGCL4SR) which leverages not only local interaction sequences but also global temporal graphs to comprehend item correlations and analyze user behaviors from a temporal perspective. Specifically, we first devise a Temporal Item Transition Graph (TITG) to fully leverage global interactions to understand item correlations, and augment this graph by dual transformations based on neighbor sampling and time disturbance. Accordingly, we design a Temporal item Transition graph Convolutional network (TiTConv) to capture temporal item transition patterns in TITG. Then, a novel Temporal Graph Contrastive Learning (TGCL) mechanism is designed to enhance the uniformity of representations between augmented graphs from identical sequences. For local interaction sequences, we design a temporal sequence encoder to incorporate time interval embeddings into the architecture of Transformer. At the training stage, we take maximum mean discrepancy and TGCL losses as auxiliary objectives. Extensive experiments on several realworld datasets show the effectiveness of TGCL4SR against state-of-the-art baselines of sequential recommendation.



Paperid:1036
Authors:Xinni Zhang, Yankai Chen, Chenhao Ma, Yixiang Fang, Irwin King
The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, Shenzhen, The Chinese University of Hong Kong, Shenzhen, The Chinese University of Hong Kong
Abstract:
Personalized recommender systems have found widespread applications for effective information filtering. Conventional models engage in knowledge mining within the static setting to reconstruct singular historical data. Nonetheless, the dynamics of realworld environments are in a constant state of flux, rendering acquired model knowledge inadequate for accommodating emergent trends and thus leading to notable recommendation performance decline. Given the typically prohibitive cost of exhaustive model retraining, it has emerged to study incremental learning for recommender systems with ever-growing data. In this paper, we propose an effective model-agnostic framework, namely INFluential Exemplar Replay (INFER). INFER facilitates recommender models in retaining the earlier assimilated knowledge, e.g., users' enduring preferences, while concurrently accommodating evolving trends manifested in users' new interaction behaviors. We commence with a vanilla implementation that centers on identifying the most representative data samples for effective consolidation of early knowledge. Subsequently, we propose an advanced solution, namely INFERONCE, to optimize the computational overhead associated with the vanilla implementation. Extensive experiments on four prototypical backbone models, two classic recommendation tasks, and four widely used benchmarks consistently demonstrate the effectiveness of our method as well as its compatibility for extending to several incremental recommender models.



Paperid:1037
Authors:Yichi Zhang, Zhihao Duan, Ming Lu, Dandan Ding, Fengqing Zhu, Zhan Ma
Hangzhou Normal University, Hangzhou, Zhejiang, China Purdue University, West Lafayette, Indiana, U.S, Purdue University, West Lafayette, Indiana, U.S, Nanjing University, Nanjing, Jiangsu, China, Hangzhou Normal University, Hangzhou, Zhejiang, China, Purdue University, West Lafayette, Indiana, U.S, Nanjing University, Nanjing, Jiangsu, China
Abstract:
While convolution and selfattention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development.



Paperid:1038
Authors:Yiqian Zhang, Yinfu Feng, Wen-Ji Zhou, Yunan Ye, Min Tan, Rong Xiao, Haihong Tang, Jiajun Ding, Jun Yu
Hangzhou Dianzi University, Alibaba International Digital Commerce Group, Alibaba International Digital Commerce Group, Alibaba International Digital Commerce Group, Hangzhou Dianzi University, Alibaba International Digital Commerce Group, Alibaba International Digital Commerce Group, Hangzhou Dianzi University, Hangzhou Dianzi University
Abstract:
Building clickthrough rate (CTR) and conversion rate (CVR) prediction models for cross-border e-commerce search requires modeling the correlations among multi-domains. Existing multi-domain methods would suffer severely from poor scalability and low efficiency when number of domains increases. To this end, we propose a Domain-Aware Multi-view mOdel (DAMO), which is domain-number-invariant, to effectively leverage cross-domain relations from a multi-view perspective. Specifically, instead of working in the original feature space defined by different domains, DAMO maps everything to a new low-rank multi-view space. To achieve this, DAMO firstly extracts multi-domain features in an explicit feature-interactive manner. These features are parsed to a multi-view extractor to obtain view-invariant and view-specific features. Then a multi-view predictor inputs these two sets of features and outputs view-based predictions. To enforce view-awareness in the predictor, we further propose a lightweight view-attention estimator to dynamically learn the optimal view-specific weights w.r.t. a view-guided loss. Extensive experiments on public and industrial datasets show that compared with state-of-the-art models, our DAMO achieves better performance with lower storage and computational costs. In addition, deploying DAMO to a large-scale cross-border e-commence platform leads to 1.21%, 1.76%, and 1.66% improvements over the existing CGC-based model in the online AB-testing experiment in terms of CTR, CVR, and Gross Merchandises Value, respectively.



Paperid:1039
Authors:Zhaofan Zhang, Yanan Xiao, Lu Jiang, Dingqi Yang, Minghao Yin, Pengyang Wang
University of Macau, Northeast Normal University, Dalian Maritime University, University of Macau, Northeast Normal University, University of Macau
Abstract:
In the realm of human mobility, the decisionmaking process for selecting the next-visit location is intricately influenced by a trade-off between spatial and temporal constraints, which are reflective of individual needs and preferences. This trade-off, however, varies across individuals, making the modeling of these spatial-temporal dynamics a formidable challenge. To address the problem, in this work, we introduce the "Spatial-temporal Induced Hierarchical Reinforcement Learning" (STI-HRL) framework, for capturing the interplay between spatial and temporal factors in human mobility decision-making. Specifically, STI-HRL employs a two-tiered decision-making process: the low-level focuses on disentangling spatial and temporal preferences using dedicated agents, while the high-level integrates these considerations to finalize the decision. To complement the hierarchical decision setting, we construct a hypergraph to organize historical data, encapsulating the multi-aspect semantics of human mobility. We propose a cross-channel hypergraph embedding module to learn the representations as the states to facilitate the decision-making cycle. Our extensive experiments on two real-world datasets validate the superiority of STI-HRL over state-of-the-art methods in predicting users' next visits across various performance metrics.



Paperid:1040
Authors:Yongsen Zheng, Ziliang Chen, Jinghui Qin, Liang Lin
Sun Yat-sen University, Jinan University, Guangdong University of Technology, Sun Yat-sen University
Abstract:
The filter bubble is a notorious issue in Recommender Systems (RSs), which describes the phenomenon whereby users are exposed to a limited and narrow range of information or content that reinforces their existing dominant preferences and beliefs. This results in a lack of exposure to diverse and varied content. Many existing works have predominantly examined filter bubbles in static or relativelystatic recommendation settings. However, filter bubbles will be continuously intensified over time due to the feedback loop between the user and the system in the real-world online recommendation. To address these issues, we propose a novel paradigm, Multi-Facet Preference Learning for Pricking Filter Bubbles in Conversational Recommender System (FacetCRS), which aims to burst filter bubbles in the conversational recommender system (CRS) through timely user-item interactions via natural language conversations. By considering diverse user preferences and intentions, FacetCRS automatically model user preference into multi-facets, including entity-, word-, context-, and review-facet, to capture diverse and dynamic user preferences to prick filter bubbles in the CRS. It is an end-to-end CRS framework to adaptively learn representations of various levels of preference facet and diverse types of external knowledge. Extensive experiments on two publicly available benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance in mitigating filter bubbles and enhancing recommendation quality in CRS.



Paperid:1041
Authors:Fan Zhou, Chen Pan, Lintao Ma, Yu Liu, Siqiao Xue, James Zhang, Jun Zhou, Hongyuan Mei, Weitao Lin, Zi Zhuang, Wenxin Ning, Yunhua Hu
Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, TTIC, Chicago USA, Ant Group, Ant Group, Ant Group, Ant Group
Abstract:
Time series forecasts of different temporal granularity are widely used in realworld applications, e.g., sales prediction in days and weeks for making different inventory plans. However, these tasks are usually solved separately without ensuring coherence, which is crucial for aligning downstream decisions. Previous works mainly focus on ensuring coherence with some straightforward methods, e.g., aggregation from the forecasts of fine granularity to the coarse ones, and allocation from the coarse granularity to the fine ones. These methods merely take the temporal hierarchical structure to maintain coherence without improving the forecasting accuracy. In this paper, we propose a novel granularity message-passing mechanism (GMP) that leverages temporal hierarchy information to improve forecasting performance and also utilizes an adaptive reconciliation (AR) strategy to maintain coherence without performance loss. Furthermore, we introduce an optimization module to achieve task-based targets while adhering to more real-world constraints. Experiments on real-world datasets demonstrate that our framework (GMP-AR) achieves superior performances on temporal hierarchical forecasting tasks compared to state-of-the-art methods. In addition, our framework has been successfully applied to a real-world task of payment traffic management in Alipay by integrating with the task-based optimization module.



Paperid:1042
Authors:Qiang Zhou, Xinjiang Lu, Jingjing Gu, Zhe Zheng, Bo Jin, Jingbo Zhou
Nanjing University of Aeronautics and Astronautics, Nanjing, China, Baidu Research, Beijing, China, Nanjing University of Aeronautics and Astronautics, Nanjing, China, Nanjing University of Aeronautics and Astronautics, Nanjing, China, Dalian University of Technology, Dalian, China, Baidu Research, Beijing, China
Abstract:
Origindestination (OD) crowd flow, if more accurately inferred at a fine-grained level, has the potential to enhance the efficacy of various urban applications. While in practice for mining OD crowd flow with effect, the problem of spatially interpolating OD crowd flow occurs since the ineluctable missing values. This problem is further complicated by the inherent scarcity and noise nature of OD crowd flow data. In this paper, we propose an uncertainty-aware interpolative and explainable framework, namely UApex, for realizing reliable and trustworthy OD crowd flow interpolation. Specifically, we first design a Variational Multi-modal Recurrent Graph Auto-Encoder (VMR-GAE) for uncertainty-aware OD crowd flow interpolation. A key idea here is to formulate the problem as semi-supervised learning on directed graphs. Next, to mitigate the data scarcity, we incorporate a distribution alignment mechanism that can introduce supplementary modals into variational inference. Then, a dedicated decoder with a Poisson prior is proposed for OD crowd flow interpolation. Moreover, to make VMR-GAE more trustworthy, we develop an efficient and uncertainty-aware explainer that can provide explanations from the spatiotemporal topology perspective via the Shapley value. Extensive experiments on two real-world datasets validate that VMR-GAE outperforms the state-of-the-art baselines. Also, an exploratory empirical study shows that the proposed explainer can generate meaningful spatiotemporal explanations.



Paperid:1043
Authors:Wei Zhou, Hong Huang, Ruize Shi, Kehan Yin, Hai Jin
National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology, Wuhan, 430074, China, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology, Wuhan, 430074, China, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology, Wuhan, 430074, China, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology, Wuhan, 430074, China, National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology, Wuhan, 430074, China
Abstract:
Heterogeneous Graph Neural Networks (HGNNs) play a vital role in advancing the field of graph representation learning by addressing the complexities arising from diverse data types and interconnected relationships in realworld scenarios. However, traditional HGNNs face challenges when applied to large-scale graphs due to the necessity of training or inferring on the entire graph. As the size of the heterogeneous graphs increases, the time and memory overhead required by these models escalates rapidly, even reaching unacceptable levels. To address this issue, in this paper, we present a novel framework named (SubInfer), which conducts training and inferring on subgraphs instead of the entire graphs, hence efficiently handling large-scale heterogeneous graphs. The proposed framework comprises three main steps: 1) partitioning the heterogeneous graph from multiple perspectives to preserve various semantic information, 2) completing the subgraphs to improve the convergence speed of subgraph training and the performance of subgraph inference, and 3) training and inferring the HGNN model on distributed clusters to further reduce the time overhead. The framework is applicable to the vast majority of HGNN models. Experiments on five benchmark datasets demonstrate that SubInfer effectively optimizes the training and inference phase, delivering comparable performance to traditional HGNN models while significantly reducing time and memory overhead.



Paperid:1044
Authors:Yasunori Akagi, Naoki Marumo, Takeshi Kurashima
NTT Human Informatics Laboratories, NTT Corporation, NTT, NTT Corporation
Abstract:
Timeinconsistency is a characteristic of human behavior in which people plan for long-term benefits but take actions that differ from the plan due to conflicts with short-term benefits. Such time-inconsistent behavior is believed to be caused by present bias, a tendency to overestimate immediate rewards and underestimate future rewards. It is essential in behavioral economics to investigate the relationship between present bias and time-inconsistency. In this paper, we propose a model for analyzing agent behavior with present bias in tasks to make progress toward a goal over a specific period. Unlike previous models, the state sequence of the agent can be described analytically in our model. Based on this property, we analyze three crucial problems related to agents under present bias: task abandonment, optimal goal setting, and optimal reward scheduling. Extensive analysis reveals how present bias affects the condition under which task abandonment occurs and optimal intervention strategies. Our findings are meaningful for preventing task abandonment and intervening through incentives in the real world.



Paperid:1045
Authors:Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, Tuomas Sandholm
Carnegie Mellon University, University of California Irvine, Massachusetts Institute of Technology, Carnegie Mellon University Strategy Robot, Inc. Optimized Markets, Inc. Strategic Machine, Inc.
Abstract:
Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond twoplayer competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controller. Specifically, under the further assumption that the game exhibits an equilibrium collapse, in that the marginals of coarse correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to stationary epsilon-NE in O(1/epsilon^2) iterations, where O suppresses polynomial factors in the natural parameters of the game. Such an equilibrium collapse is well-known to manifest itself in two-player zero-sum Markov games, but also occurs even in a class of multi-player Markov games with separable interactions, as established by recent work. As a result, we bypass known complexity barriers for computing stationary NE when either of our assumptions fails. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games.



Paperid:1046
Authors:Elliot Anshelevich, Aris Filos-Ratsikas, Christopher Jerrett, Alexandros A. Voudouris
Rensselaer Polytechnic Institute, USA, University of Edinburgh, United Kingdom, Rensselaer Polytechnic Institute, USA, University of Essex, United Kingdom
Abstract:
We consider a social choice setting in which agents and alternatives are represented by points in a metric space, and the cost of an agent for an alternative is the distance between the corresponding points in the space. The goal is to choose a single alternative to (approximately) minimize the social cost (cost of all agents) or the maximum cost of any agent, when only limited information about the preferences of the agents is given. Previous work has shown that the best possible distortion one can hope to achieve is 3 when access to the ordinal preferences of the agents is given, even when the distances between alternatives in the metric space are known. We improve upon this bound of 3 by designing deterministic mechanisms that exploit a bit of cardinal information. We show that it is possible to achieve distortion 1+sqrt(2) by using the ordinal preferences of the agents, the distances between alternatives, and a threshold approval set per agent that contains all alternatives for whom her cost is within an appropriately chosen factor of her cost for her mostpreferred alternative. We show that this bound is the best possible for any deterministic mechanism in general metric spaces, and also provide improved bounds for the fundamental case of a line metric.



Paperid:1047
Authors:Haris Aziz, Xinhang Lu, Mashbat Suzuki, Jeremy Vollen, Toby Walsh
UNSW Sydney, UNSW Sydney, UNSW Sydney, UNSW Sydney, UNSW Sydney
Abstract:
In pursuit of participatory budgeting (PB) outcomes with broader fairness guarantees, we initiate the study of lotteries over discrete PB outcomes. As the projects have heterogeneous costs, the amount spent may not be equal ex ante and ex post. To address this, we develop a technique to bound the amount by which the expost spend differs from the ex-ante spend---the property is termed budget balanced up to one project (BB1). With respect to fairness, we take a best-of-both-worlds perspective, seeking outcomes that are both ex-ante and ex-post fair. Towards this goal, we initiate a study of ex-ante fairness properties in PB, including Individual Fair Share (IFS), Unanimous Fair Share (UFS) and their stronger variants, as well as Group Fair Share (GFS). We show several incompatibility results between these ex-ante fairness notions and existing ex-post concepts based on justified representation. One of our main contributions is a randomized algorithm which simultaneously satisfies ex-ante Strong UFS, ex-post full justified representation (FJR) and ex-post BB1 for PB with binary utilities.



Paperid:1048
Authors:Haris Aziz, Isaiah Iliffe, Bo Li, Angus Ritossa, Ankang Sun, Mashbat Suzuki
UNSW Sydney, UNSW Sydney, Hong Kong Polytechnic University, UNSW Sydney, Hong Kong Polytechnic University, UNSW Sydney
Abstract:
Envyfreeness is one of the most important fairness concerns when allocating items. We study envy-free house allocation when agents have uncertain preferences over items and consider several well-studied preference uncertainty models. The central problem that we focus on is computing an allocation that has the highest probability of being envy-free. We show that each model leads to a distinct set of algorithmic and complexity results, including detailed results on (in-)approximability. En route, we consider two related problems of checking whether there exists an allocation that is possibly or necessarily envy-free. We give a complete picture of the computational complexity of these two problems for all the uncertainty models we consider.



Paperid:1049
Authors:Ian Ball, James Bono, Justin Grana, Nicole Immorlica, Brendan Lucier, Aleksandrs Slivkins
MIT, Microsoft, Edge and Node, Microsoft Research, Microsoft Research, Microsoft Research
Abstract:
We develop a model of content filtering as a game between the filter and the content consumer, where the latter incurs information costs for examining the content. Motivating examples include censoring misinformation, spam/phish filtering, and recommender systems acting on a stream of content. When the attacker is exogenous, we show that improving the filter’s quality is weakly Pareto improving, but has no impact on equilibrium payoffs until the filter becomes sufficiently accurate. Further, if the filter does not internalize the consumer’s information costs, its lack of commitment power may render it useless and lead to inefficient outcomes. When the attacker is also strategic, improvements in filter quality may decrease equilibrium payoffs.



Paperid:1050
Authors:Siddharth Barman, Umang Bhaskar, Yeshwant Pandit, Soumyajit Pyne
Indian Institute of Science, TIFR Mumbai, TIFR Mumbai, TIFR Mumbai
Abstract:
Equitability (EQ) in fair division requires that items be allocated such that all agents value the bundle they receive equally. With indivisible items, an equitable allocation may not exist, and hence we instead consider a meaningful analog, EQx, that requires equitability up to any item. EQx allocations exist for monotone, additive valuations. However, if (1) the agents' valuations are not additive or (2) the set of indivisible items includes both goods and chores (positively and negatively valued items), then prior to the current work it was not known whether EQx allocations exist or not. We study both the existence and efficient computation of EQx allocations. (1) For monotone valuations (not necessarily additive), we show that EQx allocations always exist. Also, for the large class of weakly welllayered valuations, EQx allocations can be found in polynomial time. Further, we prove that approximately EQx allocations can be computed efficiently under general monotone valuations. (2) For non-monotone valuations, we show that an EQx allocation may not exist, even for two agents with additive valuations. Under some special cases, however, we show existence and efficient computability of EQx allocations. This includes the case of two agents with additive valuations where each item is either a good or a chore, and there are no mixed items.



Paperid:1051
Authors:Omer Ben-Porat, Yishay Mansour, Michal Moshkovitz, Boaz Taitler
Technion---Israel Institute of Technology, Tel Aviv University Google Research, Bosch Center for Artificial Intelligence, Technion---Israel Institute of Technology
Abstract:
Principalagent problems arise when one party acts on behalf of another, leading to conflicts of interest. The economic literature has extensively studied principal-agent problems, and recent work has extended this to more complex scenarios such as Markov Decision Processes (MDPs). In this paper, we further explore this line of research by investigating how reward shaping under budget constraints can improve the principal's utility. We study a two-player Stackelberg game where the principal and the agent have different reward functions, and the agent chooses an MDP policy for both players. The principal offers an additional reward to the agent, and the agent picks their policy selfishly to maximize their reward, which is the sum of the original and the offered reward. Our results establish the NP-hardness of the problem and offer polynomial approximation algorithms for two classes of instances: Stochastic trees and deterministic decision processes with a finite horizon.



Paperid:1052
Authors:Vittorio Bilò, Cosimo Vinci
Università del Salento, Università del Salento
Abstract:
We address the problem of improving the worstcase efficiency of pure Nash equilibria (aka, the price of anarchy) in affine congestion games, through a novel use of signalling. We assume that, for each player in the game, a most preferred strategy is publicly signalled. This can be done either distributedly by the players themselves, or be the outcome of some centralized algorithm. We apply this signalling scheme to two well-studied scenarios: games with partially altruistic players and games with resource taxation. We show a significant improvement in the price of anarchy of these games, whenever the aggregate signalled strategy profile is a good approximation of the game social optimum.



Paperid:1053
Authors:Niclas Boehmer, Markus Brill, Alfonso Cevallos, Jonas Gehrlein, Luis Sánchez-Fernández, Ulrike Schmidt-Kraepelin
Harvard University, University of Warwick, Web3 Foundation, Web3 Foundation, Universidad Carlos III de Madrid, TU Eindhoven
Abstract:
We provide the first largescale data collection of real-world approval-based committee elections. These elections have been conducted on the Polkadot blockchain as part of their Nominated Proof-of-Stake mechanism and contain around one thousand candidates and tens of thousands of (weighted) voters each. We conduct an in-depth study of application-relevant questions, including a quantitative and qualitative analysis of the outcomes returned by different voting rules. Besides considering proportionality measures that are standard in the multiwinner voting literature, we pay particular attention to less-studied measures of overrepresentation, as these are closely related to the security of the Polkadot network. We also analyze how different design decisions such as the committee size affect the examined measures.



Paperid:1054
Authors:Markus Brill, Jannik Peters
University of Warwick, TU Berlin
Abstract:
When selecting committees based on preferences of voters, a variety of different criteria can be considered. Two natural objectives are maximizing the utilitarian welfare (the sum of voters' utilities) and coverage (the number of represented voters) of the selected committee. Previous work has studied the impact on utilitarian welfare and coverage when requiring the committee to satisfy minimal requirements such as justified representation or weak proportionality. In this paper, we consider the impact of imposing much more demanding proportionality axioms. We identify a class of voting rules that achieve strong guarantees on utilitarian welfare and coverage when combined with appropriate completions. This class is defined via a weakening of priceability and contains prominent rules such as the Method of Equal Shares. We show that committees selected by these rules (i) can be completed to achieve optimal coverage and (ii) can be completed to achieve an asymptotically optimal approximation to the utilitarian welfare if they additionally satisfy EJR+. Answering an open question of Elkind et al. (2022), we use the Greedy Justified Candidate Rule to obtain the best possible utilitarian guarantee subject to proportionality. We also consider completion methods suggested in the participatory budgeting literature and other objectives besides welfare and coverage.



Paperid:1055
Authors:Martin Bullinger, René Romen
Department of Computer Science, University of Oxford, School of Computation, Information and Technology, Technical University of Munich
Abstract:
Coalition formation is concerned with the question of how to partition a set of agents into disjoint coalitions according to their preferences. Deviating from most of the previous work, we consider an online variant of the problem, where agents arrive in sequence and whenever an agent arrives, they have to be assigned to a coalition immediately and irrevocably. The scarce existing literature on online coalition formation has focused on the objective of maximizing social welfare, a demanding requirement, even in the offline setting. Instead, we seek to achieve stable coalition structures in an online setting, and focus on stability concepts based on deviations by single agents. We present a comprehensive picture in additively separable hedonic games, leading to dichotomies, where positive results are obtained by deterministic algorithms and negative results even hold for randomized algorithms.



Paperid:1056
Authors:Martin Bullinger, Chris Dong, Patrick Lederer, Clara Mehler
University of Oxford, Technical University of Munich, Technical University of Munich, Technical University of Munich
Abstract:
In approvalbased committee (ABC) voting, the goal is to choose a subset of predefined size of the candidates based on the voters’ approval preferences over the candidates. While this problem has attracted significant attention in recent years, the incentives for voters to participate in an election for a given ABC voting rule have been neglected so far. This paper is thus the first to explicitly study this property, typically called participation, for ABC voting rules. In particular, we show that all ABC scoring rules even satisfy group participation, whereas most sequential rules severely fail participation. We furthermore explore several escape routes to the impossibility for sequential ABC voting rules: we prove for many sequential rules that (i) they satisfy participation on laminar profiles, (ii) voters who approve none of the elected candidates cannot benefit by abstaining, and (iii) it is NP-hard for a voter to decide whether she benefits from abstaining



Paperid:1057
Authors:Jakob Burkhardt, Ioannis Caragiannis, Karl Fehrs, Matteo Russo, Chris Schwiegelshohn, Sudarshan Shyam
Aarhus University, Aarhus University, Aarhus University, Sapienza University of Rome, Aarhus University, Aarhus University
Abstract:
Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of n agents located in an underlying metric space, our goal is to partition them into k clusters, optimizing some social cost objective. The metric space is defined by a distance function d between the agent locations. Information about d is available only implicitly via n rankings, through which each agent ranks all other agents in terms of their distance from her. Still, even though no cardinal information (i.e., the exact distance values) is available, we would like to evaluate clustering algorithms in terms of social cost objectives that are defined using d. This is done using the notion of distortion, which measures how far from optimality a clustering can be, taking into account all underlying metrics that are consistent with the ordinal information available. Unfortunately, the most important clustering objectives (e.g., those used in the wellknown k-median and k-center problems) do not admit algorithms with finite distortion. To sidestep this disappointing fact, we follow two alternative approaches: We first explore whether resource augmentation can be beneficial. We consider algorithms that use more than k clusters but compare their social cost to that of the optimal k-clusterings. We show that using exponentially (in terms of k) many clusters, we can get low (constant or logarithmic) distortion for the k-center and k-median objectives. Interestingly, such an exponential blowup is shown to be necessary. More importantly, we explore whether limited cardinal information can be used to obtain better results. Somewhat surprisingly, for k-median and k-center, we show that a number of queries that is polynomial in k and only logarithmic in n (i.e., only sublinear in the number of agents for the most relevant scenarios in practice) is enough to get constant distortion.



Paperid:1058
Authors:Darshan Chakrabarti, Gabriele Farina, Christian Kroer
Columbia University, MIT, Columbia University
Abstract:
We study online learning and equilibrium computation in games with polyhedral decision sets, a property shared by normalform games (NFGs) and extensive-form games (EFGs), when the learning agent is restricted to utilizing a best-response oracle. We show how to achieve constant regret in zero-sum games and O(T^0.25) regret in general-sum games while using only O(log t) best-response queries at a given iteration t, thus improving over the best prior result, which required O(T) queries per iteration. Moreover, our framework yields the first last-iterate convergence guarantees for self-play with best-response oracles in zero-sum games. This convergence occurs at a linear rate, though with a condition-number dependence. We go on to show a O(T^(-0.5)) best-iterate convergence rate without such a dependence. Our results build on linear-rate convergence results for variants of the Frank-Wolfe (FW) algorithm for strongly convex and smooth minimization problems over polyhedral domains. These FW results depend on a condition number of the polytope, known as facial distance. In order to enable application to settings such as EFGs, we show two broad new results: 1) the facial distance for polytopes in standard form is at least γ/k where γ is the minimum value of a nonzero coordinate of a vertex of the polytope and k≤n is the number of tight inequality constraints in the optimal face, and 2) the facial distance for polytopes of the form Ax=b, Cx≤d, x≥0 where x∈R^n, C≥0 is a nonzero integral matrix, and d≥0, is at least 1/(c√n), where c is the infinity norm of C. This yields the first such results for several problems such as sequence-form polytopes, flow polytopes, and matching polytopes.



Paperid:1059
Authors:Nikhil Chandak, Shashwat Goel, Dominik Peters
IIIT Hyderabad, IIIT Hyderabad, CNRS, LAMSADE, Université Paris Dauphine - PSL
Abstract:
We study the problem of fair sequential decision making given voter preferences. In each round, a decision rule must choose a decision from a set of alternatives where each voter reports which of these alternatives they approve. Instead of going with the most popular choice in each round, we aim for proportional representation, using axioms inspired by the multiwinner voting literature. The axioms require that every group of α% of the voters, if it agrees in every round (i.e., approves a common alternative), then those voters must approve at least α% of the decisions. A stronger version of the axioms requires that every group of α% of the voters that agrees in a β fraction of rounds must approve β⋅α% of the decisions. We show that three attractive voting rules satisfy axioms of this style. One of them (Sequential Phragmén) makes its decisions online, and the other two satisfy strengthened versions of the axioms but make decisions semi-online (Method of Equal Shares) or fully offline (Proportional Approval Voting). We present empirical results for these rules based on synthetic data and U.S. political elections. We also run experiments using the moral machine dataset about ethical dilemmas. We train preference models on user responses from different countries and let the models cast votes. We find that aggregating these votes using our rules leads to a more equal utility distribution across demographics than making decisions using a single global preference model.



Paperid:1060
Authors:Juhi Chaudhary, Hendrik Molter, Meirav Zehavi
Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Ben-Gurion University of the Negev
Abstract:
Given a mapping from a set of players to the leaves of a complete binary tree (called a seeding), a knockout tournament is conducted as follows: every round, every two players with a common parent compete against each other, and the winner is promoted to the common parent; then, the leaves are deleted. When only one player remains, it is declared the winner. This is a popular competition format in sports, elections, and decisionmaking. Over the past decade, it has been studied intensively from both theoretical and practical points of view. Most frequently, the objective is to seed the tournament in a way that ``assists'' (or even guarantees) some particular player to win the competition. We introduce a new objective, which is very sensible from the perspective of the directors of the competition: maximize the profit or popularity of the tournament. Specifically, we associate a ``score'' with every possible match, and aim to seed the tournament to maximize the sum of the scores of the matches that take place. We focus on the case where we assume a total order on the players' strengths, and provide a wide spectrum of results on the computational complexity of the problem.



Paperid:1061
Authors:Chandra Chekuri, Pooja Kulkarni, Rucha Kulkarni, Ruta Mehta
UIUC, UIUC, UIUC, UIUC
Abstract:
We study fair distribution of a collection of m indivisible goods among a group of n agents, using the widely recognized fairness principles of Maximin Share (MMS) and Any Price Share (APS). These principles have undergone thorough investigation within the context of additive valuations. We explore these notions for valuations that extend beyond additivity. First, we study approximate MMS under the separable (piecewiselinear) concave (SPLC) valuations, an important class generalizing additive, where the best known factor was 1/3-MMS. We show that 1/2-MMS allocation exists and can be computed in polynomial time, significantly improving the state-of-the-art. We note that SPLC valuations introduce an elevated level of intricacy in contrast to additive. For instance, the MMS value of an agent can be as high as her value for the entire set of items. We use a relax-and-round paradigm that goes through competitive equilibrium and LP relaxation. Our result extends to give (symmetric) 1/2-APS, a stronger guarantee than MMS. APS is a stronger notion that generalizes MMS by allowing agents with arbitrary entitlements. We study the approximation of APS under submodular valuation functions. We design and analyze a simple greedy algorithm using concave extensions of submodular functions. We prove that the algorithm gives a 1/3-APS allocation which matches the best-known factor. Concave extensions are hard to compute in polynomial time and are, therefore, generally not used in approximation algorithms. Our approach shows a way to utilize it within analysis (while bypassing its computation), and hence might be of independent interest.



Paperid:1062
Authors:Zhaohua Chen, Chang Wang, Qian Wang, Yuqi Pan, Zhuming Shi, Zheng Cai, Yukun Ren, Zhihua Zhu, Xiaotie Deng
CFCS, School of Computer Science, Peking University, Northwestern University, CFCS, School of Computer Science, Peking University, School of Electronics Engineering and Computer Science, Peking University, Stony Brook University, Tencent Technology (Shenzhen) Co., Ltd., Tencent Technology (Shenzhen) Co., Ltd., Tencent Technology (Shenzhen) Co., Ltd., CFCS, School of Computer Science, Peking University CMAR, Institute for Artificial Intelligence, Peking University
Abstract:
In today's online advertising markets, a crucial requirement for an advertiser is to control her total expenditure within a time horizon under some budget. Among various budget control methods, throttling has emerged as a popular choice, managing an advertiser's total expenditure by selecting only a subset of auctions to participate in. This paper provides a theoretical panorama of a single advertiser's dynamic budget throttling process in repeated secondprice auctions. We first establish a lower bound on the regret and an upper bound on the asymptotic competitive ratio for any throttling algorithm, respectively, when the advertiser's values are stochastic and adversarial. Regarding the algorithmic side, we propose the OGD-CB algorithm, which guarantees a near-optimal expected regret with stochastic values. On the other hand, when values are adversarial, we prove that this algorithm also reaches the upper bound on the asymptotic competitive ratio. We further compare throttling with pacing, another widely adopted budget control method, in repeated second-price auctions. In the stochastic case, we demonstrate that pacing is generally superior to throttling for the advertiser, supporting the well-known result that pacing is asymptotically optimal in this scenario. However, in the adversarial case, we give an exciting result indicating that throttling is also an asymptotically optimal dynamic bidding strategy. Our results bridge the gaps in theoretical research of throttling in repeated auctions and comprehensively reveal the ability of this popular budget-smoothing strategy.



Paperid:1063
Authors:Vincent Conitzer
Carnegie Mellon University
Abstract:
Usually, to apply gametheoretic methods, we must specify utilities precisely, and we run the risk that the solutions we compute are not robust to errors in this specification. Ordinal games provide an attractive alternative: they require specifying only which outcomes are preferred to which other ones. Unfortunately, they provide little guidance for how to play unless there are pure Nash equilibria; evaluating mixed strategies appears to fundamentally require cardinal utilities. In this paper, we observe that we can in fact make good use of mixed strategies in ordinal games if we consider settings that allow for folk theorems. These allow us to find equilibria that are robust, in the sense that they remain equilibria no matter which cardinal utilities are the correct ones -- as long as they are consistent with the specified ordinal preferences. We analyze this concept and study the computational complexity of finding such equilibria in a range of settings.



Paperid:1064
Authors:Kai Cui, Gökçe Dayanıklı, Mathieu Laurière, Matthieu Geist, Olivier Pietquin, Heinz Koeppl
Technische Universität Darmstadt, University of Illinois at Urbana-Champaign, NYU Shanghai, Google DeepMind, Cohere, Technische Universität Darmstadt
Abstract:
Recent techniques based on Mean Field Games (MFGs) allow the scalable analysis of multiplayer games with many similar, rational agents. However, standard MFGs remain limited to homogeneous players that weakly influence each other, and cannot model major players that strongly influence other players, severely limiting the class of problems that can be handled. We propose a novel discrete time version of major-minor MFGs (M3FGs), along with a learning algorithm based on fictitious play and partitioning the probability simplex. Importantly, M3FGs generalize MFGs with common noise and can handle not only random exogeneous environment states but also major players. A key challenge is that the mean field is stochastic and not deterministic as in standard MFGs. Our theoretical investigation verifies both the M3FG model and its algorithmic solution, showing firstly the well-posedness of the M3FG model starting from a finite game of interest, and secondly convergence and approximation guarantees of the fictitious play algorithm. Then, we empirically verify the obtained theoretical results, ablating some of the theoretical assumptions made, and show successful equilibrium learning in three example problems. Overall, we establish a learning framework for a novel and broad class of tractable games.



Paperid:1065
Authors:Michael Curry, Vinzenz Thoma, Darshan Chakrabarti, Stephen McAleer, Christian Kroer, Tuomas Sandholm, Niao He, Sven Seuken
Harvard University University of Zurich ETH AI Center, ETH Zurich ETH AI Center, Columbia University, Carnegie Mellon University, Computer Science Department, Columbia University, Carnegie Mellon University, Computer Science Department Optimized Markets, Strategy Robot, Strategic Machine, ETH Zurich, University of Zurich ETH AI Center
Abstract:
Dynamic mechanism design is a challenging extension to ordinary mechanism design in which the mechanism designer must make a sequence of decisions over time in the face of possibly untruthful reports of participating agents. Optimizing dynamic mechanisms for welfare is relatively well understood. However, there has been less work on optimizing for other goals (e.g., revenue), and without restrictive assumptions on valuations, it is remarkably challenging to characterize good mechanisms. Instead, we turn to automated mechanism design to find mechanisms with good performance in specific problem instances. We extend the class of affine maximizer mechanisms to MDPs where agents may untruthfully report their rewards. This extension results in a challenging bilevel optimization problem in which the upper problem involves choosing optimal mechanism parameters, and the lower problem involves solving the resulting MDP. Our approach can find truthful dynamic mechanisms that achieve strong performance on goals other than welfare, and can be applied to essentially any problem settingwithout restrictions on valuations---for which RL can learn optimal policies.



Paperid:1066
Authors:Greg d'Eon, Sophie Greenwood, Kevin Leyton-Brown, James R. Wright
University of British Columbia, University of British Columbia Cornell University, University of British Columbia, University of Alberta
Abstract:
Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative loglikelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to the question of which loss functions should be used for this task, formalizing axioms that we argue loss functions should satisfy. We construct a family of loss functions, which we dub ``diagonal bounded Bregman divergences'', that satisfy all of these axioms. These rule out many loss functions used in practice, but notably include squared L2 error; we thus recommend its use for evaluating behavioral models.



Paperid:1067
Authors:Théo Delemazure, Jérôme Lang, Grzegorz Pierczyński
CNRS, Université Paris Dauphine, PSL, CNRS, Université Paris Dauphine, PSL, University of Warsaw
Abstract:
We give a quantitative analysis of the independence of irrelevant alternatives (IIA) axiom. IIA says that the society's preference between x and y should depend only on individual preferences between x and y: we show that, in several contexts, if the individuals express their preferences about additional (``irrelevant'') alternatives, this information helps to estimate better which of x and y has higher social welfare. Our contribution is threefold: (1) we provide a new tool to measure the impact of IIA on social welfare (pairwise distortion), based on the wellestablished notion of voting distortion, (2) we study the average impact of IIA in both general and metric settings, with experiments on synthetic and real data and (3) we study the worst-case impact of IIA in the 1D-Euclidean metric space.



Paperid:1068
Authors:Argyrios Deligkas, Eduard Eiben, Viktoriia Korchemna, Šimon Schierreich
Royal Holloway University of London, Royal Holloway, University of London, TU Wien, Czech Technical University in Prague
Abstract:
We study the computational complexity of fairly allocating a set of indivisible items under externalities. In this recentlyproposed setting, in addition to the utility the agent gets from their bundle, they also receive utility from items allocated to other agents. We focus on the extended definitions of envy-freeness up to one item (EF1) and of envy-freeness up to any item (EFX), and we provide the landscape of their complexity for several different scenarios. We prove that it is NP-complete to decide whether there exists an EFX allocation, even when there are only three agents, or even when there are only six different values for the items. We complement these negative results by showing that when both the number of agents and the number of different values for items are bounded by a parameter the problem becomes fixed-parameter tractable. Furthermore, we prove that two-valued and binary-valued instances are equivalent and that EFX and EF1 allocations coincide for this class of instances. Finally, motivated from real-life scenarios, we focus on a class of structured valuation functions, which we term agent/item-correlated. We prove their equivalence to the "standard" setting without externalities. Therefore, all previous results for EF1 and EFX apply immediately for these valuations.



Paperid:1069
Authors:Xiaotie Deng, Hangxin Gan, Ningyuan Li, Weian Li, Qi Qi
Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing, China., School of Mathematical Science, Nankai University, Tianjin, China., Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing, China., Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing, China., Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China.
Abstract:
We investigate a twostage competitive model involving multiple contests. In this model, each contest designer chooses two participants from a pool of candidate contestants and determines the biases. Contestants strategically distribute their efforts across various contests within their budget. We first show the existence of a pure strategy Nash equilibrium (PNE) for the contestants, and propose a fully polynomial-time approximation scheme to compute an approximate PNE. In the scenario where designers simultaneously decide the participants and biases, the subgame perfect equilibrium (SPE) may not exist. Nonetheless, when designers' decisions are made in two substages, the existence of SPE is established. In the scenario where designers can hold multiple contests, we show that the SPE always exists under mild conditions and can be computed efficiently.



Paperid:1070
Authors:Chris Dong, Patrick Lederer
Technical University of Munich, Technical University of Munich
Abstract:
In approvalbased committee (ABC) elections, the goal is to select a fixed-size subset of the candidates, a so-called committee, based on the voters' approval ballots over the candidates. One of the most popular classes of ABC voting rules are ABC scoring rules, for which voters give points to each committee and the committees with maximal total points are chosen. While the set of ABC scoring rules has recently been characterized in a model where the output is a ranking of all committees, no full characterization of these rules exists in the standard model where a set of winning committees is returned. We address this issue by characterizing two important subclasses of ABC scoring rules in the standard ABC election model, thereby both extending the result for ABC ranking rules to the standard setting and refining it to subclasses. In more detail, by relying on a consistency axiom for variable electorates, we characterize (i) the prominent class of Thiele rules and (ii) a new class of ABC voting rules called ballot size weighted approval voting. Based on these theorems, we also infer characterizations of three well-known ABC voting rules, namely multi-winner approval voting, proportional approval voting, and satisfaction approval voting.



Paperid:1071
Authors:Seyed A. Esmaeili, Darshan Chakrabarti, Hayley Grape, Brian Brubach
Simons Laufer Mathematical Sciences Institute and University of Chicago Data Science Institute, Columbia University, Wellesley College, Wellesley College
Abstract:
In representative democracy, a redistricting map is chosen to partition an electorate into districts which each elects a representative. A valid redistricting map must satisfy a collection of constraints such as being compact, contiguous, and of almostequal population. However, these constraints are loose enough to enable an enormous ensemble of valid redistricting maps. This enables a partisan legislature to gerrymander by choosing a map which unfairly favors it. In this paper, we introduce an interpretable and tractable distance measure over redistricting maps which does not use election results and study its implications over the ensemble of redistricting maps. Specifically, we define a central map which may be considered "most typical" and give a rigorous justification for it by showing that it mirrors the Kemeny ranking in a scenario where we have a committee voting over a collection of redistricting maps to be drawn. We include runnning time and sample complexity analysis for our algorithms, including some negative results which hold using any algorithm. We further study outlier detection based on this distance measure and show that our framework can detect some gerrymandered maps. More precisely, we show some maps that are widely considered to be gerrymandered that lie very far away from our central maps in comparison to a large ensemble of valid redistricting maps. Since our distance measure does not rely on election results, this gives a significant advantage in gerrymandering detection which is lacking in all previous methods.



Paperid:1072
Authors:Michal Feldman, Simon Mauras, Tomasz Ponitka
Tel Aviv University, Microsoft ILDC, Tel Aviv University, Tel Aviv University
Abstract:
A major problem in fair division is how to allocate a set of indivisible resources among agents fairly and efficiently. The goal of this work is to characterize the tradeoffs between two wellstudied measures of fairness and efficiency --- envy freeness up to any item (EFX) for fairness, and Nash welfare for efficiency --- by saying, for given constants α and β, whether there exists an α-EFX allocation that guarantees a β-fraction of the maximum Nash welfare (β-MNW). For additive valuations, we show that for any α ∈ [0,1], there exists a partial allocation that is α-EFX and 1/(α+1)-MNW. This tradeoff turns out to be tight (for every α) as demonstrated by an impossibility result that we give. We also show that for α ∈ [0, φ-1 ≃ 0.618] these partial allocations can be turned into complete allocations where all items are assigned. Furthermore, for any α ∈ [0, 1/2], we show that the tight tradeoff of α-EFX and 1/(α+1)-MNW with complete allocations holds for the more general setting of subadditive valuations. Our results improve upon the current state of the art, for both additive and subadditive valuations, and match the best-known approximations of EFX under complete allocations, regardless of Nash welfare guarantees. Notably, our constructions for additive valuations also provide EF1 and constant approximations for maximin share guarantees.



Paperid:1073
Authors:Bailey Flanigan, Jennifer Liang, Ariel D. Procaccia, Sven Wang
Carnegie Mellon University, Harvard University, Harvard University, Massachusetts Institute of Technology
Abstract:
Among the recent work on designing algorithms for selecting citizens' assembly participants, one key property of these algorithms has not yet been studied: their manipulability. Strategic manipulation is a concern because these algorithms must satisfy representation constraints according to volunteers' selfreported features; misreporting these features could thereby increase a volunteer's chance of being selected, decrease someone else's chance, and/or increase the expected number of seats given to their group. Strikingly, we show that Leximin — an algorithm that is widely used for its fairness — is highly manipulable in this way. We then introduce a new class of selection algorithms that use Lp norms as objective functions. We show that the manipulability of the Lp-based algorithm decreases in O(1/n^(1-1/p)) as the number of volunteers n grows, approaching the optimal rate of O(1/n) as p approaches infinity. These theoretical results are confirmed via experiments in eight real-world datasets.



Paperid:1074
Authors:Rupert Freeman, Ulrike Schmidt-Kraepelin
University of Virginia, TU Eindhoven
Abstract:
We study the budget aggregation problem in which a set of strategic voters must split a finite divisible resource (such as money or time) among a set of competing projects. Our goal is twofold: We seek truthful mechanisms that provide fairness guarantees to the projects. For the first objective, we focus on the class of moving phantom mechanisms, which are to this day -- essentially the only known truthful mechanisms in this setting. For project fairness, we consider the mean division as a fair baseline, and bound the maximum difference between the funding received by any project and this baseline. We propose a novel and simple moving phantom mechanism that provides optimal project fairness guarantees. As a corollary of our results, we show that our new mechanism minimizes the L1 distance to the mean for three projects and gives the first non-trivial bounds on this quantity for more than three projects.



Paperid:1075
Authors:Gianluigi Greco, Francesco Scarcello
DEMACS, University of Calabria, Italy, DIMES, University of Calabria, Italy
Abstract:
Fair allocation of indivisible goods presents intriguing challenges from both a social choice perspective and an algorithmic standpoint. Due to the indivisibility of goods, it is common for one agent to envy the bundle of goods assigned to another agent and, indeed, envyfree solutions do not exist in general. In line with the classical game-theoretic concept of Nucleolus in coalitional games, we propose that a fair allocation should minimize the agents’ dissatisfaction profile in a lexicographic manner, where the dissatisfaction of an agent is defined as her maximum envy towards other agents. Therefore, we seek allocations that minimize the maximum envy. In cases where multiple solutions have an equal maximum value, we minimize the second-worst value, and so on. Additionally, as is customary in fair division problems, we also consider an efficiency requirement: among the allocations with the best agents’ dissatisfaction profile, we prioritize those that maximize the sum of agents’ utilities, known as maximum social welfare. Such allocations, referred to as maxileximin allocations, always exist. In this study, we analyze the computational properties of maxileximin allocations in the context of fair allocation problems with constraints. Specifically, we focus on the Connected Fair Division problem, where goods correspond to the nodes of a graph, and a bundle of goods is allowed if the subgraph formed by those goods is connected. We demonstrate that the problem is F∆P2 -complete, even for instances with simple graphical structures such as path and star graphs. However, we identify islands of tractability for instances with more intricate graphs, such as those having bounded treewidth, provided that the number of agents is bounded by a fixed number and utility functions use small values.



Paperid:1076
Authors:Svenja M. Griesbach, Martin Hoefer, Max Klimm, Tim Koglin
TU Berlin, Goethe University Frankfurt, TU Berlin, Goethe University Frankfurt
Abstract:
We study a novel approach to information design in the standard traffic model of network congestion games. It captures the natural condition that the demand is unknown to the users of the network. A principal (e.g., a mobility service) commits to a signaling strategy, observes the realized demand and sends a (public) signal to agents (i.e., users of the network). Based on the induced belief about the demand, the users then form an equilibrium. We consider the algorithmic goal of the principal: Compute a signaling scheme that minimizes the expected total cost of the induced equilibrium. We concentrate on singlecommodity networks and affine cost functions, for which we obtain the following results. First, we devise a fully polynomial-time approximation scheme (FPTAS) for the case that the demand can only take two values. It relies on several structural properties of the cost of the induced equilibrium as a function of the updated belief about the distribution of demands. We show that this function is piecewise linear for any number of demands, and monotonic for two demands. Second, we give a complete characterization of the graph structures for which it is optimal to fully reveal the information about the realized demand. This signaling scheme turns out to be optimal for all cost functions and probability distributions over demands if and only if the graph is series-parallel. Third, we propose an algorithm that computes the optimal signaling scheme for any number of demands whose time complexity is polynomial in the number of supports that occur in a Wardrop equilibrium for some demand. Finally, we conduct a computational study that tests this algorithm on real-world instances.



Paperid:1077
Authors:Yue Guan, Mohammad Afshari, Panagiotis Tsiotras
Georgia Institute of Technology, Georgia Institute of Technology, Georgia Institute of Technology
Abstract:
This work studies the behaviors of two largepopulation teams competing in a discrete environment. The team-level interactions are modeled as a zero-sum game while the agent dynamics within each team is formulated as a collaborative mean-field team problem. Drawing inspiration from the mean-field literature, we first approximate the large-population team game with its infinite-population limit. Subsequently, we construct a fictitious centralized system and transform the infinite-population game to an equivalent zero-sum game between two coordinators. Via a novel reachability analysis, we study the optimality of coordination strategies, which induce decentralized strategies under the original information structure. The optimality of the resulting strategies is established in the original finite-population game, and the theoretical guarantees are verified by numerical examples.



Paperid:1078
Authors:Mingyu Guo
The University of Adelaide
Abstract:
We study worstcase VCG redistribution mechanism design for the public project problem. The mechanism design task comes down to designing a payment function that maximizes the worst-case allocative efficiency ratio. We use a multilayer perceptron (MLP) with ReLU activation to model the payment function and use mixed integer programming (MIP) to solve for the worst-case type profiles that maximally violate the mechanism design constraints. We collect these worst-case type profiles and use them as training samples to train toward better worst-case mechanisms. In practice, we require a tiny neural network structure for the above approach to scale. The Lottery Ticket Hypothesis states that a large network is likely to contain a "winning ticket" -- a much smaller subnetwork that "won the initialization lottery", which makes its training particularly effective. Motivated by this hypothesis, we train a large network and prune it into a tiny subnetwork. We run MIP-based worst-case training on the drawn subnetwork and evaluate the resulting mechanism's worst-case performance. If the subnetwork does not achieve good worst-case performance, then we record the type profiles that cause the current draw to be bad. To draw again, we restore the large network to its initial weights and prune using recorded type profiles from earlier draws, therefore avoiding drawing the same ticket twice. We expect to eventually encounter a tiny subnetwork that leads to effective training for our worst-case mechanism design task. Lastly, a by-product of multiple ticket draws is an ensemble of mechanisms with different worst cases, which improves the worst-case performance further. Using our approach, we find previously unknown optimal mechanisms for up to 5 agents. Our results confirm the tightness of existing theoretical upper bounds. For up to 20 agents, we derive significantly improved worst-case mechanisms, surpassing a long list of existing manual results.



Paperid:1079
Authors:Sushmita Gupta, Ramanujan Sridharan, Peter Strulo
The Institute of Mathematical Sciences, HBNI, Chennai, University of Warwick, University of Warwick
Abstract:
Singleelimination (SE) tournaments are a popular format used in competitive environments and decision making. Algorithms for SE tournament manipulation have been an active topic of research in recent years. In this paper, we initiate the algorithmic study of a novel variant of SE tournament manipulation that aims to model the fact that certain matchups are highly desired in a sporting context, incentivizing an organizer to manipulate the bracket to make such matchups take place. We obtain both hardness and tractability results. We show that while the problem of computing a bracket enforcing a given set of matches in an SE tournament is NP-hard, there are natural restrictions that lead to polynomial-time solvability. In particular, we show polynomial-time solvability if there is a linear ordering on the ability of players with only a constant number of exceptions where a player with lower ability beats a player with higher ability.



Paperid:1080
Authors:Mohammad Hajiaghayi, Mohammad Mahdavi, Keivan Rezaei, Suho Shin
University of Maryland, University of Maryland, University of Maryland, University of Maryland
Abstract:
We present a study on a repeated delegated choice problem, which is the first to consider an online learning variant of Kleinberg and Kleinberg, EC'18. In this model, a principal interacts repeatedly with an agent who possesses an exogenous set of solutions to search for efficient ones. Each solution can yield varying utility for both the principal and the agent, and the agent may propose a solution to maximize its own utility in a selfish manner. To mitigate this behavior, the principal announces an eligible set which screens out a certain set of solutions. The principal, however, does not have any information on the distribution of solutions nor the number of solutions in advance. Therefore, the principal dynamically announces various eligible sets to efficiently learn the distribution. The principal's objective is to minimize cumulative regret compared to the optimal eligible set in hindsight. We explore two dimensions of the problem setup, whether the agent behaves myopically or strategizes across the rounds, and whether the solutions yield deterministic or stochastic utility. We obtain sublinear regret upper bounds in various regimes, and derive corresponding lower bounds which implies the tightness of the results. Overall, we bridge a wellknown problem in economics to the evolving area of online learning, and present a comprehensive study in this problem.



Paperid:1081
Authors:Haoqiang Huang, Zihe Wang, Zhide Wei, Jie Zhang
Hong Kong University of Science and Technology, Renmin University of China, Peking University, University of Bath
Abstract:
In this paper, we delve into the problem of using monetary incentives to encourage players to shift from an initial Nash equilibrium to a more favorable one within a game. Our main focus revolves around computing the minimum reward required to facilitate this equilibrium transition. The game involves a single row player who possesses m strategies and k column players, each endowed with n strategies. Our findings reveal that determining whether the minimum reward is zero is NPcomplete, and computing the minimum reward becomes APX-hard. Nonetheless, we bring some positive news, as this problem can be efficiently handled if either k or n is a fixed constant. Furthermore, we have devised an approximation algorithm with an additive error that runs in polynomial time. Lastly, we explore a specific case wherein the utility functions exhibit single-peaked characteristics, and we successfully demonstrate that the optimal reward can be computed in polynomial time.



Paperid:1082
Authors:Ayumi Igarashi, Naoyuki Kamiyama, Warut Suksompong, Sheung Man Yuen
University of Tokyo, Kyushu University, National University of Singapore, National University of Singapore
Abstract:
In the allocation of indivisible goods, a prominent fairness notion is envyfreeness up to one good (EF1). We initiate the study of reachability problems in fair division by investigating the problem of whether one EF1 allocation can be reached from another EF1 allocation via a sequence of exchanges such that every intermediate allocation is also EF1. We show that two EF1 allocations may not be reachable from each other even in the case of two agents, and deciding their reachability is PSPACE-complete in general. On the other hand, we prove that reachability is guaranteed for two agents with identical or binary utilities as well as for any number of agents with identical binary utilities. We also examine the complexity of deciding whether there is an EF1 exchange sequence that is optimal in the number of exchanges required.



Paperid:1083
Authors:Ayumi Igarashi, Martin Lackner, Oliviero Nardi, Arianna Novaro
University of Tokyo, DBAI, TU Wien, DBAI, TU Wien, CES, Université Paris 1 Panthéon-Sorbonne
Abstract:
The problem of fairly allocating a set of indivisible items is a wellknown challenge in the field of (computational) social choice. In this scenario, there is a fundamental incompatibility between notions of fairness (such as envy-freeness and proportionality) and economic efficiency (such as Pareto-optimality). However, in the real world, items are not always allocated once and for all, but often repeatedly. For example, the items may be recurring chores to distribute in a household. Motivated by this, we initiate the study of the repeated fair division of indivisible goods and chores, and propose a formal model for this scenario. In this paper, we show that, if the number of repetitions is a multiple of the number of agents, there always exists a sequence of allocations that is proportional and Pareto-optimal. On the other hand, irrespective of the number of repetitions, an envy-free and Pareto-optimal sequence of allocations may not exist. For the case of two agents, we show that if the number of repetitions is even, it is always possible to find a sequence of allocations that is overall envy-free and Pareto-optimal. We then prove even stronger fairness guarantees, showing that every allocation in such a sequence satisfies some relaxation of envy-freeness. Finally, in case that the number of repetitions can be chosen freely, we show that envy-free and Pareto-optimal allocations are achievable for any number of agents.



Paperid:1084
Authors:Aviram Imber, Jonas Israel, Markus Brill, Hadas Shachnai, Benny Kimelfeld
Technion – Israel Institute of Technology, Technische Universität Berlin, University of Warwick Technische Universität Berlin, Technion – Israel Institute of Technology, Technion – Israel Institute of Technology
Abstract:
We consider spatial voting where candidates are located in the Euclidean ddimensional space, and each voter ranks candidates based on their distance from the voter's ideal point. We explore the case where information about the location of voters' ideal points is incomplete: for each dimension, we are given an interval of possible values. We study the computational complexity of finding the possible and necessary winners for positional scoring rules. Our results show that we retain tractable cases of the classic model where voters have partial-order preferences. Moreover, we show that there are positional scoring rules under which the possible-winner problem is intractable for partial orders, but tractable in the one-dimensional spatial setting. We also consider approval voting in this setting. We show that for up to two dimensions, the necessary-winner problem is tractable, while the possible-winner problem is hard for any number of dimensions.



Paperid:1085
Authors:Pallavi Jain, Rohit Vaish
Indian Institute of Technology Jodhpur, Indian Institute of Technology Delhi
Abstract:
The maximum Nash social welfare (NSW)which maximizes the geometric mean of agents' utilities---is a fundamental solution concept with remarkable fairness and efficiency guarantees. The computational aspects of NSW have been extensively studied for *one-sided* preferences where a set of agents have preferences over a set of resources. Our work deviates from this trend and studies NSW maximization for *two-sided* preferences, wherein a set of workers and firms, each having a cardinal valuation function, are matched with each other. We provide a systematic study of the computational complexity of maximizing NSW for many-to-one matchings under two-sided preferences. Our main negative result is that maximizing NSW is NP-hard even in a highly restricted setting where each firm has capacity 2, all valuations are in the range {0,1,2}, and each agent positively values at most three other agents. In search of positive results, we develop approximation algorithms as well as parameterized algorithms in terms of natural parameters such as the number of workers, the number of firms, and the firms' capacities. We also provide algorithms for restricted domains such as symmetric binary valuations and bounded degree instances.



Paperid:1086
Authors:Jihyeok Jung, Chan-Oi Song, Deok-Joo Lee, Kiho Yoon
Seoul National University, Korea University, Seoul National University, Korea University
Abstract:
This study introduces an optimal mechanism in a dynamic stochastic knapsack environment. The model features a single seller who has a fixed quantity of a perfectly divisible item. Impatient buyers with a piecewise linear utility function arrive randomly and they report the two-dimensional private information: marginal value and demanded quantity. We derive a revenue-maximizing dynamic mechanism in a finite discrete time framework that satisfies incentive compatibility, individual rationality, and feasibility conditions. This is achieved by characterizing buyers' utility and utilizing the Bellman equation. Moreover, we establish the essential penalty scheme for incentive compatibility, as well as the allocation and payment policies. Lastly, we propose algorithms to approximate the optimal policy, based on the Monte Carlo simulation-based regression method and reinforcement learning.



Paperid:1087
Authors:Yusuf Kalayci, David Kempe, Vikram Kher
University of Southern California, University of Southern California, Yale University
Abstract:
We introduce a novel definition for a small set R of k points being "representative" of a larger set in a metric space. Given a set V (e.g., documents or voters) to represent, and a set C of possible representatives, our criterion requires that for any subset S comprising a theta fraction of V, the average distance of S to their best theta*k points in R should not be more than a factor gamma compared to their average distance to the best theta*k points among all of C. This definition is a strengthening of proportional fairness and core fairness, but different from those notions - requires that large cohesive clusters be represented proportionally to their size. Since there are instances for which - unless gamma is polynomially large - no solutions exist, we study this notion in a resource augmentation framework, implicitly stating the constraints for a set R of size k as though its size were only k/alpha, for alpha > 1. Furthermore, motivated by the application to elections, we mostly focus on the "ordinal" model, where the algorithm does not learn the actual distances; instead, it learns only for each point v in V and each candidate pairs c, c' which of c, c' is closer to v. Our main result is that the Expanding Approvals Rule (EAR) of Aziz and Lee is (alpha, gamma) representative with gamma <= 1 + 6.71 * (alpha)/(alpha-1). Our results lead to three notable byproducts. First, we show that the EAR achieves constant proportional fairness in the ordinal model, giving the first positive result on metric proportional fairness with ordinal information. Second, we show that for the core fairness objective, the EAR achieves the same asymptotic tradeoff between resource augmentation and approximation as the recent results of Li et al., which used full knowledge of the metric. Finally, our results imply a very simple single-winner voting rule with metric distortion at most 44.



Paperid:1088
Authors:Yasushi Kawase, Kazuhisa Makino, Hanna Sumita, Akihisa Tamura, Makoto Yokoo
University of Tokyo, Kyoto University, Tokyo Institute of Technology, Keio University, Kyushu University
Abstract:
We study the fair division of indivisible items with subsidies among n agents, where the absolute marginal valuation of each item is at most one. Under monotone valuations (where each item is a good), it is known that a maximum subsidy of 2(n1) and a total subsidy of 2(n-1)² are sufficient to guarantee the existence of an envy-freeable allocation. In this paper, we improve upon these bounds, even in a wider model. Namely, we show that, given an EF1 allocation, we can compute in polynomial time an envy-free allocation with a subsidy of at most n-1 per agent and a total subsidy of at most n(n-1)/2. Moreover, we present further improved bounds for monotone valuations.



Paperid:1089
Authors:Jiaqian Li, Minming Li, Hau Chan
City University of Hong Kong, City University of Hong Kong, University of Nebraska-Lincoln
Abstract:
We study the groupfair obnoxious facility location problems from the mechanism design perspective where agents belong to different groups and have private location preferences on the undesirable locations of the facility. Our main goal is to design strategyproof mechanisms that elicit the true location preferences from the agents and determine a facility location that approximately optimizes several group-fair objectives. We first consider the maximum total and average group cost (group-fair) objectives. For these objectives, we propose deterministic mechanisms that achieve 3-approximation ratios and provide matching lower bounds. We then provide the characterization of 2-candidate strategyproof randomized mechanisms. Leveraging the characterization, we design randomized mechanisms with improved approximation ratios of 2 for both objectives. We also provide randomized lower bounds of 5/4 for both objectives. Moreover, we investigate intergroup and intragroup fairness (IIF) objectives, addressing fairness between groups and within each group. We present a mechanism that achieves a 4-approximation for the IIF objectives and provide tight lower bounds.



Paperid:1090
Authors:Junkang Li, Bruno Zanuttini, Véronique Ventos
NukkAI, Paris, France Normandie Univ.; UNICAEN, ENSICAEN, CNRS, GREYC, 14 000 Caen, France, Normandie Univ.; UNICAEN, ENSICAEN, CNRS, GREYC, 14 000 Caen, France, NukkAI, Paris, France
Abstract:
Games with incomplete information are games that model situations where players do not have common knowledge about the game they play, e.g. card games such as poker or bridge. Opponent models can be of crucial importance for decisionmaking in such games. We propose algorithms for computing optimal and/or robust strategies in games with incomplete information, given various types of knowledge about opponent models. As an application, we describe a framework for reasoning about an opponent's reasoning in such games, where opponent models arise naturally.



Paperid:1091
Authors:Miao Li, Yuhan Cao, Dengji Zhao
ShanghaiTech University, ShanghaiTech University, ShanghaiTech University
Abstract:
Mechanism design on social networks has attracted extensive attention recently. The goal is to design mechanisms to incentivize participants to invite more participants via their social networks, and the challenge is that the participants are competitors. Various mechanisms have been proposed for single/multiple-unit auctions, but it has been shown that it is challenging to design such mechanisms for more complex settings. We move this forward to investigate a double auction on a network where each trader (a buyer or a seller) can link to other buyers and sellers. Incentiving invitation is more difficult than in multi-unit one-sided auctions, because there are two different roles and a buyer (seller) seems happy to invite a seller (buyer), but again the invited seller (buyer) may invite another buyer (seller) to compete with the original buyer (seller). To combat this, we propose a solution called dynamic trade reduction (DTR), which also guarantees a non-negative revenue for the market owner. Interestingly, our solution is also applicable to the multi-unit one-sided auction when there is only one seller linking to only buyers on the network. We believe that the principle of our solution has the potential to be extended to design the multi-item one-sided auction.



Paperid:1092
Authors:Taylor Lundy, Narun Raman, Hu Fu, Kevin Leyton-Brown
University of British Columbia, University of British Columbia, Shanghai University of Finance and Economics Key Laboratory of Interdisciplinary Research of Computation and Economics, Ministry of Education, University of British Columbia
Abstract:
Mobile gaming is a rapidly growing and incredibly profitable sector; having grown sevenfold over the past 10 years, it now grosses over $100 billion annually. This growth was due in large part to a shift in monetization strategies: rather than charging players an upfront cost ("pay-to-play"), games often request optional microtransactions throughout gameplay ("free-to-play"). We focus on a common scenario in which games include wait times---gating either items or game progression---that players can pay to skip. Game designers typically say that they optimize for player happiness rather than revenue; however, prices for skips are typically set at levels that few players are willing to pay, leading to low purchase rates. Under a traditional analysis, it would seem that game designers fail at their stated goal if few players buy what they are selling. We argue that an alternate model can better explain this dynamic: players value tasks more highly as they are perceived to be more difficult. While skips can increase players' utilities by providing instant gratification, pricing skips too cheaply can lower players' utilities by decreasing the perceived amount of work needed to complete a task. We show that high revenue, high player utility, and low purchase rates can all coexist under this model, particularly under a realistic distribution of players having few buyers but a few big-spending "whales." We also investigate how a game designer should optimize prices under our model. An appendix of the paper with proofs, more comprehensive results and visualizations can be found at https://arxiv.org/abs/2312.10205.



Paperid:1093
Authors:Luisa Montanari, Ulrike Schmidt-Kraepelin, Warut Suksompong, Nicholas Teh
Technische Universität Berlin, TU Eindhoven, National University of Singapore, University of Oxford
Abstract:
We investigate the fair allocation of indivisible goods to agents with possibly different entitlements represented by weights. Previous work has shown that guarantees for additive valuations with existing envybased notions cannot be extended to the case where agents have matroid-rank (i.e., binary submodular) valuations. We propose two families of envy-based notions for matroid-rank and general submodular valuations, one based on the idea of transferability and the other on marginal values. We show that our notions can be satisfied via generalizations of rules such as picking sequences and maximum weighted Nash welfare. In addition, we introduce welfare measures based on harmonic numbers, and show that variants of maximum weighted harmonic welfare offer stronger fairness guarantees than maximum weighted Nash welfare under matroid-rank valuations.



Paperid:1094
Authors:Nikolas Patris, Stelios Stavroulakis, Fivos Kalogiannis, Rose Zhang, Ioannis Panageas
University of California, Irvine Archimedes Research Unit, University of California, Irvine, University of California, Irvine Archimedes Research Unit, University of California, Irvine, University of California, Irvine
Abstract:
We consider the problem of computing Nash equilibria in potential games where each player's strategy set is subject to private uncoupled constraints. This scenario is frequently encountered in realworld applications like road network congestion games where individual drivers adhere to personal budget and fuel limitations. Despite the plethora of algorithms that efficiently compute Nash equilibria (NE) in potential games, the domain of constrained potential games remains largely unexplored. We introduce an algorithm that leverages the Lagrangian formulation of NE. The algorithm is implemented independently by each player and runs in polynomial time with respect to the approximation error, the sum of the size of the action-spaces, and the game's inherit parameters.



Paperid:1095
Authors:Adam Richardson, Boi Faltings
EPFL, EPFL
Abstract:
Peer prediction incentive mechanisms for crowdsourcing are generally limited to eliciting samples from categorical distributions. Prior work on extending peer prediction to arbitrary distributions has largely relied on assumptions on the structures of the distributions or known properties of the data providers. We introduce a novel class of incentive mechanisms that extend peer prediction mechanisms to arbitrary distributions by replacing the notion of an exact match with a concept of neighborhood matching. We present conditions on the belief updates of the data providers that guarantee incentivecompatibility for rational data providers, and admit a broad class of possible reasonable updates.



Paperid:1096
Authors:Ermis Nikiforos Soumalias, Jakob Weissteiner, Jakob Heiss, Sven Seuken
University of Zurich ETH AI Center, University of Zurich ETH AI Center, ETH Zürich ETH AI Center, University of Zurich ETH AI Center
Abstract:
We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)based preference elicitation algorithms that aim to elicit only the most important information from bidders. However, from a practical point of view, the main shortcoming of this prior work is that those designs elicit bidders' preferences via value queries (i.e., “What is your value for the bundle {A, B}?''). In most real-world ICA domains, value queries are considered impractical, since they impose an unrealistically high cognitive burden on bidders, which is why they are not used in practice. In this paper, we address this shortcoming by designing an ML-powered combinatorial clock auction that elicits information from the bidders only via demand queries (i.e., “At prices p, what is your most preferred bundle of items?''). We make two key technical contributions: First, we present a novel method for training an ML model on demand queries. Second, based on those trained ML models, we introduce an efficient method for determining the demand query with the highest clearing potential, for which we also provide a theoretical foundation. We experimentally evaluate our ML-based demand query mechanism in several spectrum auction domains and compare it against the most established real-world ICA: the combinatorial clock auction (CCA). Our mechanism significantly outperforms the CCA in terms of efficiency in all domains, it achieves higher efficiency in a significantly reduced number of rounds, and, using linear prices, it exhibits vastly higher clearing potential. Thus, with this paper we bridge the gap between research and practice and propose the first practical ML-powered ICA.



Paperid:1097
Authors:Max Springer, MohammadTaghi Hajiaghayi, Hadi Yami
University of Maryland, College Park MD, University of Maryland, College Park MD, Microsoft Corporation, Redmond WA
Abstract:
We here address the problem of fairly allocating indivisible goods or chores to n agents with weights that define their entitlement to the set of indivisible resources. Stemming from wellstudied fairness concepts such as envy-freeness up to one good (EF1) and envy-freeness up to any good (EFX) for agents with equal entitlements, we present, in this study, the first set of impossibility results alongside algorithmic guarantees for fairness among agents with unequal entitlements. Within this paper, we expand the concept of envy-freeness up to any good or chore to the weighted context (WEFX and XWEF respectively), demonstrating that these allocations are not guaranteed to exist for two or three agents. Despite these negative results, we develop a WEFX procedure for two agents with integer weights, and furthermore, we devise an approximate WEFX procedure for two agents with normalized weights. We further present a polynomial-time algorithm that guarantees a weighted envy-free allocation up to one chore (1WEF) for any number of agents with additive cost functions. Our work underscores the heightened complexity of the weighted fair division problem when compared to its unweighted counterpart.



Paperid:1098
Authors:Kiran Tomlinson, Johan Ugander, Jon Kleinberg
Cornell University, Stanford University, Cornell University
Abstract:
Instant runoff voting (IRV) has recently gained popularity as an alternative to plurality voting for political elections, with advocates claiming a range of advantages, including that it produces more moderate winners than plurality and could thus help address polarization. However, there is little theoretical backing for this claim, with existing evidence focused on case studies and simulations. In this work, we prove that IRV has a moderating effect relative to plurality voting in a precise sense, developed in a 1dimensional Euclidean model of voter preferences. We develop a theory of exclusion zones, derived from properties of the voter distribution, which serve to show how moderate and extreme candidates interact during IRV vote tabulation. The theory allows us to prove that if voters are symmetrically distributed and not too concentrated at the extremes, IRV cannot elect an extreme candidate over a moderate. In contrast, we show plurality can and validate our results computationally. Our methods provide new frameworks for the analysis of voting systems, deriving exact winner distributions geometrically and establishing a connection between plurality voting and stick-breaking processes.



Paperid:1099
Authors:Giannis Tyrovolas, Andrei Constantinescu, Edith Elkind
Independent, ETH Zurich, University of Oxford
Abstract:
We consider binary group decisionmaking under a rich model of liquid democracy: agents submit ranked delegation options, where each option may be a function of multiple agents' votes; e.g., "I vote yes if a majority of my friends vote yes." Such ballots are unravelled into a profile of direct votes by selecting one entry from each ballot so as not to introduce cyclic dependencies. We study delegation via monotonic Boolean functions, and two unravelling procedures: MinSum, which minimises the sum of the ranks of the chosen entries, and its egalitarian counterpart, MinMax. We provide complete computational dichotomies: MinSum is hard to compute (and approximate) as soon as any non-trivial functions are permitted, and polynomial otherwise; for MinMax the easiness results extend to arbitrary-arity logical ORs and ANDs taken in isolation, but not beyond. For the classic model of delegating to individual agents, we give asymptotically near-tight algorithms for carrying out the two procedures and efficient algorithms for finding optimal unravellings with the highest vote count for a given alternative. These algorithms inspire novel tie-breaking rules for the setup of voting to change a status quo. We then introduce a new axiom, which can be viewed as a variant of the participation axiom, and use algorithmic techniques developed earlier in the paper to show that it is satisfied by MinSum and a lexicographic refinement of MinMax (but not MinMax itself).



Paperid:1100
Authors:Yujia Wang, Haoran Yu
Beijing Institute of Technology, Beijing Institute of Technology
Abstract:
Game theory and machine learning are two widely used techniques for predicting the outcomes of strategic interactions among humans. However, the game theorybased approach often relies on strong rationality and informational assumptions, while the machine learning-based approach typically requires the testing data to come from the same distribution as the training data. Our work studies how to integrate the two techniques to address these weaknesses. We focus on the interactions among real bidders in penny auctions, and develop a three-stage framework to predict the distributions of auction durations, which indicate the numbers of bids and auctioneer revenues. Specifically, we first leverage a pre-trained neural network to encode the descriptions of products in auctions into embeddings. Second, we apply game theory models to make preliminary predictions of auction durations. In particular, we tackle the challenge of accurately inferring parameters in game theory models. Third, we develop a Multi-Branch Mixture Density Network to learn the mapping from product embeddings and game-theoretic predictions to the distributions of actual auction durations. Experiments on real-world penny auction data demonstrate that our framework outperforms both game theory-based and machine learning-based prediction approaches.



Paperid:1101
Authors:Yadong Xu, Bonan Ni, Weiran Shen, Xun Wang, Zichen Wang, Yinsong Xue, Pingzhong Tang
Institute for Interdisciplinary Information Sciences, Tsinghua University, Institute for Interdisciplinary Information Sciences, Tsinghua University, Gaoling School of Artificial Intelligence, Renmin University of China, Institute for Interdisciplinary Information Sciences, Tsinghua University, ByteDance, ByteDance, Institute for Interdisciplinary Information Sciences, Tsinghua University
Abstract:
Online advertising has been one of the most important sources for industry's growth, where the demandside platforms (DSP) play an important role via bidding to the ad exchanges on behalf of their advertiser clients. Since more and more ad exchanges have shifted from second to first price auctions, it is challenging for DSPs to adjust bidding strategy in the volatile environment. Recent studies on bid shading in first-price auctions may have limited performance due to relatively strong hypotheses about winning probability distribution. Moreover, these studies do not consider the incentive of advertiser clients, which can be crucial for a reliable advertising platform. In this work, we consider both the optimization of bid shading technique and the design of internal auction which is ex-post incentive compatible (IC) for the management of a DSP. Firstly, we prove that the joint design of bid shading and ex-post IC auction can be reduced to choosing one monotone bid function for each advertiser without loss of optimality. Then we propose a parameterized neural network to implement the monotone bid functions. With well-designed surrogate loss, the objective can be optimized in an end-to-end manner. Finally, our experimental results demonstrate the effectiveness and superiority of our algorithm.



Paperid:1102
Authors:Yixuan Even Xu, Chun Kai Ling, Fei Fang
Tsinghua University, Columbia University Carnegie Mellon University, Carnegie Mellon University
Abstract:
Coalitions naturally exist in many realworld systems involving multiple decision makers such as ridesharing, security, and online ad auctions, but the coalition structure among the agents is often unknown. We propose and study an important yet previously overseen problem -- Coalition Structure Learning (CSL), where we aim to carefully design a series of games for the agents and infer the underlying coalition structure by observing their interactions in those games. We establish a lower bound on the sample complexity -- defined as the number of games needed to learn the structure -- of any algorithms for CSL and propose the Iterative Grouping (IG) algorithm for designing normal-form games to achieve the lower bound. We show that IG can be extended to other succinct games such as congestion games and graphical games. Moreover, we solve CSL in a more restrictive and practical setting: auctions. We show a variant of IG to solve CSL in the auction setting even if we cannot design the bidder valuations. Finally, we conduct experiments to evaluate IG in the auction setting and the results align with our theoretical analysis.



Paperid:1103
Authors:Yixuan Even Xu, Hanrui Zhang, Vincent Conitzer
Tsinghua University, Simons Laufer Mathematical Sciences Institute, Carnegie Mellon University
Abstract:
Bilateral trade is one of the most natural and important forms of economic interaction: A seller has a single, indivisible item for sale, and a buyer is potentially interested. The two parties typically have different, privately known valuations for the item, and ideally, they would like to trade if the buyer values the item more than the seller. The celebrated impossibility result by Myerson and Satterthwaite shows that any mechanism for this setting must violate at least one important desideratum. In this paper, we investigate a richer paradigm of bilateral trade, with many selfinterested buyers and sellers on both sides of a single trade who cannot be excluded from the trade. We show that this allows for more positive results. In fact, we establish a dichotomy in the possibility of trading efficiently. If in expectation, the buyers value the item more, we can achieve efficiency in the limit. If this is not the case, then efficiency cannot be achieved in general. En route, we characterize trading mechanisms that encourage truth-telling, which may be of independent interest. We also evaluate our trading mechanisms experimentally, and the experiments align with our theoretical results.



Paperid:1104
Authors:Zongjun Yang, Luofeng Liao, Christian Kroer
School of Electronics Engineering and Computer Science, Peking University, Columbia University, Columbia University
Abstract:
We study an online allocation problem with sequentially arriving items and adversarially chosen agent values, with the goal of balancing fairness and efficiency. Our goal is to study the performance of algorithms that achieve strong guarantees under other input models such as stochastic inputs, in order to achieve robust guarantees against a variety of inputs. To that end, we study the PACE (Pacing According to Current Estimated utility) algorithm, an existing algorithm designed for stochastic input. We show that in the equalbudgets case, PACE is equivalent to an integral greedy algorithm. We go on to show that with natural restrictions on the adversarial input model, both the greedy allocation and PACE have asymptotically bounded multiplicative envy as well as competitive ratio for Nash welfare, with the multiplicative factors either constant or with optimal order dependence on the number of agents. This completes a "best-of-many-worlds" guarantee for PACE, since past work showed that PACE achieves guarantees for stationary and stochastic-but-non-stationary input models.



Paperid:1105
Authors:Brian Hu Zhang, Tuomas Sandholm
Carnegie Mellon University, Carnegie Mellon University Strategy Robot, Inc. Optimized Markets, Inc. Strategic Machine, Inc.
Abstract:
We investigate two notions of correlated equilibrium for extensiveform games: the extensive-form correlated equilibrium (EFCE) and the behavioral correlated equilibrium (BCE). We show that the two are outcome-equivalent, in the sense that every outcome distribution achievable under one notion is achievable under the other. Our result implies, to our knowledge, the first polynomial-time algorithm for computing a BCE.



Paperid:1106
Authors:Yichi Zhang, Grant Schoenebeck, Weijie Su
University of Michigan, University of Michigan, University of Pennsylvania
Abstract:
In the setting of conference peer review, the conference aims to accept highquality papers and reject low-quality papers based on noisy review scores. A recent work proposes the isotonic mechanism, which can elicit the ranking of paper qualities from an author with multiple submissions to help improve the conference's decisions. However, the isotonic mechanism relies on the assumption that the author's utility is both an increasing and a convex function with respect to the review score, which is often violated in realistic settings (e.g.~when authors aim to maximize the number of accepted papers). In this paper, we propose a sequential review mechanism that can truthfully elicit the ranking information from authors while only assuming the agent's utility is increasing with respect to the true quality of her accepted papers. The key idea is to review the papers of an author in a sequence based on the provided ranking and conditioning the review of the next paper on the review scores of the previous papers. Advantages of the sequential review mechanism include: 1) eliciting truthful ranking information in a more realistic setting than prior work; 2) reducing the reviewing workload and increasing the average quality of papers being reviewed; 3) incentivizing authors to write fewer papers of higher quality.



Paperid:1107
Authors:Houyu Zhou, Tianze Wei, Biaoshuai Tao, Minming Li
City University of Hong Kong, City University of Hong Kong, Shanghai Jiao Tong University, City University of Hong Kong
Abstract:
We initiate the study of fair allocation with the set of divisible or indivisible items distributed in multiple regions. The key requirement is that each agent can only obtain items from one region. In this work, we consider two kinds of fairness concepts: envybased notions including envy-freeness (EF) and envy-freeness up to one/any item (EF1/EFX), and share-based notions including proportionality (PROP) and proportionality up to one/any item (PROP1/PROPX). On the negative side, we show NP-hardness and inapproximability results about the aforementioned fairness notions. On the positive side, we propose several algorithms to compute the partial allocations that satisfy envy-based notions and allocations that approximate the above fairness notions.



Paperid:1108
Authors:Houyu Zhou, Hau Chan, Minming Li
City University of Hong Kong, University of Nebraska-Lincoln, City University of Hong Kong
Abstract:
We study the facility location problems (FLPs) with altruistic agents who act to benefit others in their affiliated groups. Our aim is to design mechanisms that elicit true locations from the agents in different overlapping groups and place a facility to serve agents to approximately optimize a given objective based on agents' costs to the facility. Existing studies of FLPs consider myopic agents who aim to minimize their own costs to the facility. We mainly consider altruistic agents with wellmotivated group costs that are defined over costs incurred by all agents in their groups. Accordingly, we define Pareto strategyproofness to account for altruistic agents and their multiple group memberships with incomparable group costs. We consider mechanisms satisfying this strategyproofness under various combinations of the planner's objectives and agents' group costs. For each of these settings, we provide upper and lower bounds of approximation ratios of the mechanisms satisfying Pareto strategyproofness.



Paperid:1109
Authors:Yotam Amitai, Yael Septon, Ofra Amir
Technion - Israel Institute of Technology, Faculty of Data and Decision Science, Technion - Israel Institute of Technology, Faculty of Data and Decision Science, Technion - Israel Institute of Technology, Faculty of Data and Decision Science
Abstract:
Explainable reinforcement learning (XRL) methods aim to help elucidate agent policies and decisionmaking processes. The majority of XRL approaches focus on local explanations, seeking to shed light on the reasons an agent acts the way it does at a specific world state. While such explanations are both useful and necessary, they typically do not portray the outcomes of the agent's selected choice of action. In this work, we propose ``COViz'', a new local explanation method that visually compares the outcome of an agent's chosen action to a counterfactual one. In contrast to most local explanations that provide state-limited observations of the agent's motivation, our method depicts alternative trajectories the agent could have taken from the given state and their outcomes. We evaluated the usefulness of COViz in supporting people's understanding of agents' preferences and compare it with reward decomposition, a local explanation method that describes an agent's expected utility for different actions by decomposing it into meaningful reward types. Furthermore, we examine the complementary benefits of integrating both methods. Our results show that such integration significantly improved participants' performance.



Paperid:1110
Authors:Lvye Cui, Haoran Yu
Beijing Institute of Technology, Beijing Institute of Technology
Abstract:
Inferring the private information of humans from their strategic behavioral data is crucial and challenging. The main approach is first obtaining human behavior functions (which map public information and human private information to behavior), enabling subsequent inference of private information from observed behavior. Most existing studies rely on strong equilibrium assumptions to obtain behavior functions. Our work focuses on continuous double auctions, where multiple traders with heterogeneous rationalities and beliefs dynamically trade commodities and deriving equilibria is generally intractable. We develop a knowledgeaware machine learning-based framework to infer each trader's private cost vectors for producing different units of its commodity. Our key idea is to learn behavior functions by incorporating the statistical knowledge about private costs given the observed trader asking behavior across the population. Specifically, we first use a neural network to characterize each trader's behavior function. Second, we leverage the statistical knowledge to derive the posterior distribution of each trader's private costs given its observed asks. Third, through designing a novel loss function, we utilize the knowledge-based posterior distributions to guide the learning of the neural network. We conduct extensive experiments on a large experimental dataset, and demonstrate the superior performance of our framework over baselines in inferring the private information of humans.



Paperid:1111
Authors:Shiqi Dai, Xuanyu Zhu, Naiqi Li, Tao Dai, Zhi Wang
Tsinghua University, Tsinghua University, Tsinghua University, Shenzhen University, Tsinghua University Peng Cheng Laboratory
Abstract:
Level generation is a central focus of Procedural Content Generation (PCG), yet deep learningbased approaches are limited by scarce training data, i.e., human-designed levels. Despite being a dominant framework, Generative Adversarial Networks (GANs) exhibit a substantial quality gap between generated and human-authored levels, alongside rising training costs, particularly with increasing token complexity. In this paper, we introduce a diffusion-based generative model that learns from just one example. Our approach involves two core components: 1) an efficient yet expressive level representation, and 2) a latent denoising network with constrained receptive fields. To start with, our method utilizes token semantic labels, similar to word embeddings, to provide dense representations. This strategy not only surpasses one-hot encoding in representing larger game levels but also improves stability and accelerates convergence in latent diffusion. In addition, we adapt the denoising network architecture to confine the receptive field to localized patches of the data, aiming to facilitate single-example learning. Extensive experiments demonstrate that our model is capable of generating stylistically congruent samples of arbitrary sizes compared to manually designed levels. It suits a wide range of level structures with fewer artifacts than GAN-based approaches. The source code is available at https://github.com/shiqi-dai/diffusioncraft.



Paperid:1112
Authors:Kate Donahue, Sreenivas Gollapudi, Kostas Kollias
Cornell, Google Research, Google Research
Abstract:
Historically, much of machine learning research has focused on the performance of the algorithm alone, but recently more attention has been focused on optimizing joint humanalgorithm performance. Here, we analyze a specific type of human-algorithm collaboration where the algorithm has access to a set of n items, and presents a subset of size k to the human, who selects a final item from among those k. This scenario could model content recommendation, route planning, or any type of labeling task. Because both the human and algorithm have imperfect, noisy information about the true ordering of items, the key question is: which value of k maximizes the probability that the best item will be ultimately selected? For k=1, performance is optimized by the algorithm acting alone, and for k=n it is optimized by the human acting alone. Surprisingly, we show that for multiple of noise models, it is optimal to set k in [2, n-1] - that is, there are strict benefits to collaborating, even when the human and algorithm have equal accuracy separately. We demonstrate this theoretically for the Mallows model and experimentally for the Random Utilities models of noisy permutations. However, we show this pattern is *reversed* when the human is anchored on the algorithm's presented ordering - the joint system always has strictly worse performance. We extend these results to the case where the human and algorithm differ in their accuracy levels, showing that there always exist regimes where a more accurate agent would strictly benefit from collaborating with a less accurate one, but these regimes are asymmetric between the human and the algorithm's accuracy.



Paperid:1113
Authors:Dongrui Gao, Haokai Zhang, Pengrui Li, Tian Tang, Shihong Liu, Zhihong Zhou, Shaofei Ying, Ye Zhu, Yongqing Zhang
Chengdu University of Information Technology University of Electronic Science and Technology of China, Chengdu University of Information Technology, Chengdu University of Information Technology University of Electronic Science and Technology of China, Chengdu University of Information Technology, Chengdu University of Information Technology, Chengdu University of Information Technology, University of Electronic Science and Technology of China, Chengdu University of Information Technology, Chengdu University of Information Technology
Abstract:
Neuroscience research indicates that the interaction among different functional regions of the brain plays a crucial role in driving various cognitive tasks. Existing studies have primarily focused on constructing either local or global functional connectivity maps within the brain, often lacking an adaptive approach to fuse functional brain regions and explore latent relationships between localization during different cognitive tasks. This paper introduces a novel approach called the LocalAscending-Global Learning Strategy (LAG) to uncover higher-level latent topological patterns among functional brain regions. The strategy initiates from the local connectivity of individual brain functional regions and develops a K-Level Self-Adaptive Ascending Network (SALK) to dynamically capture strong connectivity patterns among brain regions during different cognitive tasks. Through the step-by-step fusion of brain regions, this approach captures higher-level latent patterns, shedding light on the progressively adaptive fusion of various brain functional regions under different cognitive tasks. Notably, this study represents the first exploration of higher-level latent patterns through progressively adaptive fusion of diverse brain functional regions under different cognitive tasks. The proposed LAG strategy is validated using datasets related to fatigue (SEED-VIG), emotion (SEED-IV), and motor imagery (BCI_C_IV_2a). The results demonstrate the generalizability of LAG, achieving satisfactory outcomes in independent-subject experiments across all three datasets. This suggests that LAG effectively characterizes higher-level latent patterns associated with different cognitive tasks, presenting a novel approach to understanding brain patterns in varying cognitive contexts.



Paperid:1114
Authors:Dongyu Gong, Xingchen Wan, Dingmin Wang
University of Oxford Yale University, University of Oxford, University of Oxford
Abstract:
Working memory is a critical aspect of both human intelligence and artificial intelligence, serving as a workspace for the temporary storage and manipulation of information. In this paper, we systematically assess the working memory capacity of ChatGPT, a large language model developed by OpenAI, by examining its performance in verbal and spatial nback tasks under various conditions. Our experiments reveal that ChatGPT has a working memory capacity limit strikingly similar to that of humans. Furthermore, we investigate the impact of different instruction strategies on ChatGPT's performance and observe that the fundamental patterns of a capacity limit persist. From our empirical findings, we propose that n-back tasks may serve as tools for benchmarking the working memory capacity of large language models and hold potential for informing future efforts aimed at enhancing AI working memory.



Paperid:1115
Authors:Yifeng Huang, Duc Duy Nguyen, Lam Nguyen, Cuong Pham, Minh Hoai
Stony Brook University, NY, USA, VinAI, Hanoi, Vietnam, VinAI, Hanoi, Vietnam, VinAI, Hanoi, Vietnam Posts & Telecommunications Institute of Technology, Hanoi, Vietnam, Stony Brook University, NY, USA VinAI, Hanoi, Vietnam
Abstract:
This paper addresses the task of counting human actions of interest using sensor data from wearable devices. We propose a novel exemplarbased framework, allowing users to provide exemplars of the actions they want to count by vocalizing predefined sounds ``one'', ``two'', and ``three''. Our method first localizes temporal positions of these utterances from the audio sequence. These positions serve as the basis for identifying exemplars representing the action class of interest. A similarity map is then computed between the exemplars and the entire sensor data sequence, which is further fed into a density estimation module to generate a sequence of estimated density values. Summing these density values provides the final count. To develop and evaluate our approach, we introduce a diverse and realistic dataset consisting of real-world data from 37 subjects and 50 action categories, encompassing both sensor and audio data. The experiments on this dataset demonstrate the viability of the proposed method in counting instances of actions from new classes and subjects that were not part of the training data. On average, the discrepancy between the predicted count and the ground truth value is 7.47, significantly lower than the errors of the frequency-based and transformer-based methods. Our project, code and dataset can be found at https://github.com/cvlab-stonybrook/ExRAC.



Paperid:1116
Authors:W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum
University of Texas at Austin Google, University of Texas at Austin, Google, MIT, University of California, Berkeley, University of Texas at Austin Sony AI, University of Massachusetts Amherst
Abstract:
We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of the approximation of the optimal advantage function is less desirable than the appropriate and simpler approach of greedy maximization of it. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.



Paperid:1117
Authors:Mingcheng Li, Dingkang Yang, Yuxuan Lei, Shunli Wang, Shuaibing Wang, Liuzhen Su, Kun Yang, Yuzheng Wang, Mingyang Sun, Lihua Zhang
Fudan University CIT Lab, Fudan University CIT Lab, Fudan University, Fudan University, Fudan University, Fudan University, Fudan University, Fudan University, Fudan University, Fudan University CIT Lab Engineering Research Center of AI and Robotics, Ministry of Education Jilin Provincial Key Laboratory of Intelligence Science and Engineering
Abstract:
Multimodal Sentiment Analysis (MSA) has attracted widespread research attention recently. Most MSA studies are based on the assumption of modality completeness. However, many inevitable factors in realworld scenarios lead to uncertain missing modalities, which invalidate the fixed multimodal fusion approaches. To this end, we propose a Unified multimodal Missing modality self-Distillation Framework (UMDF) to handle the problem of uncertain missing modalities in MSA. Specifically, a unified self-distillation mechanism in UMDF drives a single network to automatically learn robust inherent representations from the consistent distribution of multimodal data. Moreover, we present a multi-grained crossmodal interaction module to deeply mine the complementary semantics among modalities through coarse- and fine-grained crossmodal attention. Eventually, a dynamic feature integration module is introduced to enhance the beneficial semantics in incomplete modalities while filtering the redundant information therein to obtain a refined and robust multimodal representation. Comprehensive experiments on three datasets demonstrate that our framework significantly improves MSA performance under both uncertain missing-modality and complete-modality testing conditions.



Paperid:1118
Authors:Zhuoyan Li, Zhuoran Lu, Ming Yin
Purdue University, Purdue University, Purdue University
Abstract:
With the rapid development of AIbased decision aids, different forms of AI assistance have been increasingly integrated into the human decision making processes. To best support humans in decision making, it is essential to quantitatively understand how diverse forms of AI assistance influence humans' decision making behavior. To this end, much of the current research focuses on the end-to-end prediction of human behavior using ``black-box'' models, often lacking interpretations of the nuanced ways in which AI assistance impacts the human decision making process. Meanwhile, methods that prioritize the interpretability of human behavior predictions are often tailored for one specific form of AI assistance, making adaptations to other forms of assistance difficult. In this paper, we propose a computational framework that can provide an interpretable characterization of the influence of different forms of AI assistance on decision makers in AI-assisted decision making. By conceptualizing AI assistance as the ``nudge'' in human decision making processes, our approach centers around modelling how different forms of AI assistance modify humans' strategy in weighing different information in making their decisions. Evaluations on behavior data collected from real human decision makers show that the proposed framework outperforms various baselines in accurately predicting human behavior in AI-assisted decision making. Based on the proposed framework, we further provide insights into how individuals with different cognitive styles are nudged by AI assistance differently.



Paperid:1119
Authors:Chenglong Liu, Haoran Wei, Jinze Yang, Jintao Liu, Wenxi Li, Yuchen Guo, Lu Fang
University of Chinese Academy of Sciences BNRist, Tsinghua University, MEGVII Technology, University of Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai Jiao Tong University, BNRist, Tsinghua University, BNRist, Tsinghua University
Abstract:
Performing person detection in superhigh-resolution images has been a challenging task. For such a task, modern detectors, which usually encode a box using center and width/height, struggle with accuracy due to two factors: 1) Human characteristic: people come in various postures and the center with high freedom is difficult to capture robust visual pattern; 2) Image characteristic: due to vast scale diversity of input (gigapixel-level), distance regression (for width and height) is hard to pinpoint, especially for a person, with substantial scale, who is near the camera. To address these challenges, we propose GigaHumanDet, an innovative solution aimed at further enhancing detection accuracy for gigapixel-level images. GigaHumanDet employs the corner modeling method to avoid the potential issues of a high degree of freedom in center pinpointing. To better distinguish similar-looking persons and enforce instance consistency of corner pairs, an instance-guided learning approach is designed to capture discriminative individual semantics. Further, we devise reliable shape-aware bodyness equipped with a multi-precision strategy as the human corner matching guidance to be appropriately adapted to the single-view large scene. Experimental results on PANDA and STCrowd datasets show the superiority and strong applicability of our design. Notably, our model achieves 82.4% in term of AP, outperforming current state-of-the-arts by more than 10%.



Paperid:1120
Authors:Bingjun Luo, Haowen Wang, Jinpeng Wang, Junjie Zhu, Xibin Zhao, Yue Gao
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
With the strong robusticity on illumination variations, nearinfrared (NIR) can be an effective and essential complement to visible (VIS) facial expression recognition in low lighting or complete darkness conditions. However, facial expression recognition (FER) from NIR images presents a more challenging problem than traditional FER due to the limitations imposed by the data scale and the difficulty of extracting discriminative features from incomplete visible lighting contents. In this paper, we give the first attempt at deep NIR facial expression recognition and propose a novel method called near-infrared facial expression transformer (NFER-Former). Specifically, to make full use of the abundant label information in the field of VIS, we introduce a Self-Attention Orthogonal Decomposition mechanism that disentangles the expression information and spectrum information from the input image, so that the expression features can be extracted without the interference of spectrum variation. We also propose a Hypergraph-Guided Feature Embedding method that models some key facial behaviors and learns the structure of the complex correlations between them, thereby alleviating the interference of inter-class similarity. Additionally, we construct a large NIR-VIS Facial Expression dataset that includes 360 subjects to better validate the efficiency of NFER-Former. Extensive experiments and ablation studies show that NFER-Former significantly improves the performance of NIR FER and achieves state-of-the-art results on the only two available NIR FER datasets, Oulu-CASIA and Large-HFE.



Paperid:1121
Authors:Malek Mechergui, Sarath Sreedharan
Colorado State University, Colorado State University
Abstract:
While the question of misspecified objectives has gotten much attention in recent years, most works in this area primarily focus on the challenges related to the complexity of the objective specification mechanism (for example, the use of reward functions). However, the complexity of the objective specification mechanism is just one of many reasons why the user may have misspecified their objective. A foundational cause for misspecification that is being overlooked by these works is the inherent asymmetry in human expectations about the agent's behavior and the behavior generated by the agent for the specified objective. To address this, we propose a novel formulation for the objective misspecification problem that builds on the humanaware planning literature, which was originally introduced to support explanation and explicable behavioral generation. Additionally, we propose a first-of-its-kind interactive algorithm that is capable of using information generated under incorrect beliefs about the agent to determine the true underlying goal of the user.



Paperid:1122
Authors:Reshef Meir, Viet-An Nguyen, Xu Chen, Jagdish Ramakrishnan, Udi Weinsberg
Technion--Israel Institute of Technology, Central Applied Science, Meta, Central Applied Science, Meta, Central Applied Science, Meta, Central Applied Science, Meta
Abstract:
Crowdsourcing platforms use various truth discovery algorithms to aggregate annotations from multiple labelers. In an online setting, however, the main challenge is to decide whether to ask for more annotations for each item to efficiently trade off cost (i.e., the number of annotations) for quality of the aggregated annotations. In this paper, we propose a novel approach for general complex annotation (such as bounding boxes and taxonomy paths), that works in an online crowdsourcing setting. We prove that the expected average similarity of a labeler is linear in their accuracy conditional on the reported label. This enables us to infer reported label accuracy in a broad range of scenarios. We conduct extensive evaluations on realworld crowdsourcing data from Meta and show the effectiveness of our proposed online algorithms in improving the cost-quality trade-off.



Paperid:1123
Authors:Katherine Metcalf, Miguel Sarabia, Masha Fedzechkina, Barry-John Theobald
Apple, Apple, Apple, Apple
Abstract:
Preferencebased Reinforcement Learning (PbRL) enables non-experts to train Reinforcement Learning models using preference feedback. However, the effort required to collect preference labels from real humans means that PbRL research primarily relies on synthetic labellers. We validate the most common synthetic labelling strategy by comparing against labels collected from a crowd of humans on three Deep Mind Control (DMC) suite tasks: stand, walk, and run. We find that: (1) the synthetic labels are a good proxy for real humans under some circumstances, (2) strong preference label agreement between human and synthetic labels is not necessary for similar policy performance, (3) policy performance is higher at the start of training from human feedback and is higher at the end of training from synthetic feedback, and (4) training on only examples with high levels of inter-annotator agreement does not meaningfully improve policy performance. Our results justify the use of synthetic labellers to develop and ablate PbRL methods, and provide insight into how human labelling changes over the course of policy training.



Paperid:1124
Authors:Hussein Mozannar, Gagan Bansal, Adam Fourney, Eric Horvitz
Massachusetts Institute of Technology, MSR, MSR, MSR
Abstract:
AI powered coderecommendation systems, such as Copilot and CodeWhisperer, provide code suggestions inside a programmer's environment (e.g., an IDE) with the aim of improving productivity. We pursue mechanisms for leveraging signals about programmers' acceptance and rejection of code suggestions to guide recommendations. We harness data drawn from interactions with GitHub Copilot, a system used by millions of programmers, to develop interventions that can save time for programmers. We introduce a utility-theoretic framework to drive decisions about suggestions to display versus withhold. The approach, conditional suggestion display from human feedback (CDHF), relies on a cascade of models that provide the likelihood that recommended code will be accepted. These likelihoods are used to selectively hide suggestions, reducing both latency and programmer verification time. Using data from 535 programmers, we perform a retrospective evaluation of CDHF and show that we can avoid displaying a significant fraction of suggestions that would have been rejected. We further demonstrate the importance of incorporating the programmer's latent unobserved state in decisions about when to display suggestions through an ablation study. Finally, we showcase how using suggestion acceptance as a reward signal for guiding the display of suggestions can lead to suggestions of reduced quality, indicating an unexpected pitfall.



Paperid:1125
Authors:Daehee Park, Jaewoo Jeong, Kuk-Jin Yoon
KAIST, KAIST, KAIST
Abstract:
Multiagent trajectory prediction is crucial for various practical applications, spurring the construction of many large-scale trajectory datasets, including vehicles and pedestrians. However, discrepancies exist among datasets due to external factors and data acquisition strategies. External factors include geographical differences and driving styles, while data acquisition strategies include data acquisition rate, history/prediction length, and detector/tracker error. Consequently, the proficient performance of models trained on large-scale datasets has limited transferability on other small-size datasets, bounding the utilization of existing large-scale datasets. To address this limitation, we propose a method based on continuous and stochastic representations of Neural Stochastic Differential Equations (NSDE) for alleviating discrepancies due to data acquisition strategy. We utilize the benefits of continuous representation for handling arbitrary time steps and the use of stochastic representation for handling detector/tracker errors. Additionally, we propose a dataset-specific diffusion network and its training framework to handle dataset-specific detection/tracking errors. The effectiveness of our method is validated against state-of-the-art trajectory prediction models on the popular benchmark datasets: nuScenes, Argoverse, Lyft, INTERACTION, and Waymo Open Motion Dataset (WOMD). Improvement in performance gain on various source and target dataset configurations shows the generalized competence of our approach in addressing cross-dataset discrepancies.



Paperid:1126
Authors:Sen Pei, Shixiong Xu, Xiaojie Jin
ByteDance Inc., Institute of Automation, Chinese Academy of Sciences, Bytedance Inc.
Abstract:
Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most userappealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlights detection method named Global Prototype Encoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed LiveFood, including over 5,100 live gourmet videos that consist of four domains: ingredients, cooking, presentation, and eating. To the best of our knowledge, this is the first work to explore video highlights detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain incremental learning methods on LiveFood, achieving significant mAP improvements on all domains. Concerning the classic datasets, GPE also yields comparable performance as previous arts. The code is available at: https://github.com/ForeverPs/IncrementalVHD_GPE.



Paperid:1127
Authors:Rajalaxmi Rajagopalan, Yu-Lin Wei, Romit Roy Choudhury
University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign
Abstract:
We consider the problem of personalizing audio to maximize user experience. Briefly, we aim to find a filter h*, which applied to any music or speech, will maximize the user’s satisfaction. This is a blackbox optimization problem since the user’s satisfaction function is unknown. Substantive work has been done on this topic where the key idea is to play audio samples to the user, each shaped by a different filter hi, and query the user for their satisfaction scores f(hi). A family of “surrogate” functions is then designed to fit these scores and the optimization method gradually refines these functions to arrive at the filter ˆh* that maximizes satisfaction. In certain applications, we observe that a second type of querying is possible where users can tell us the individual elements h*[j] of the optimal filter h*. Consider an analogy from cooking where the goal is to cook a recipe that maximizes user satisfaction. A user can be asked to score various cooked recipes (e.g., tofu fried rice) or to score individual ingredients (say, salt, sugar, rice, chicken, etc.). Given a budget of B queries, where a query can be of either type, our goal is to find the recipe that will maximize this user’s satisfaction. Our proposal builds on Sparse Gaussian Process Regression (GPR) and shows how a hybrid approach can outperform any one type of querying. Our results are validated through simulations and real world experiments, where volunteers gave feedback on music/speech audio and were able to achieve high satisfaction levels. We believe this idea of hybrid querying opens new problems in black-box optimization and solutions can benefit other applications beyond audio personalization.



Paperid:1128
Authors:Idan Toker, David Sarne, Jonathan Schler
Bar Ilan University, Bar-Ilan University, Holon Institute of Technology (HIT)
Abstract:
This paper focuses in the inherent anchoring bias present in sequential reviewssentiment corpora annotation processes. It proposes employing a limited subset of meticulously chosen reviews at the outset of the process, as a means of calibration, effectively mitigating the phenomenon. Through extensive experimentation we validate the phenomenon of sentiment bias in the annotation process and show that its magnitude can be influenced by pre-calibration. Furthermore, we show that the choice of the calibration set matters, hence the need for effective guidelines for choosing the reviews to be included in it. A comparison of annotators performance with the proposed calibration to annotation processes that do not use calibration or use a randomly-picked calibration set, reveals that indeed the calibration set picked is highly effective---it manages to substantially reduce the average absolute error compared to the other cases. Furthermore, the proposed selection guidelines are found to be highly robust in picking an effective calibration set also for domains different than the one based on which these rules were extracted.



Paperid:1129
Authors:Binglu Wang, Chenxi Guo, Yang Jin, Haisheng Xia, Nian Liu
Xi'an University of Architecture and Technology Beijing Institute of Technology, Xi'an University of Architecture and Technology, Xi'an University of Architecture and Technology, University of Science and Technology of China, Mohamed bin Zayed University of Artificial Intelligence
Abstract:
Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNNbased object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.



Paperid:1130
Authors:Miaohui Wang, Rong Zhang, Lirong Huang, Yanshan Li
Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, China, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, China, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, China, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, China
Abstract:
Composite images (CIs) typically combine various elements from different scenes, views, and styles, which are a very important information carrier in the era of mixed media such as virtual reality, mixed reality, metaverse, etc. However, the complexity of CI content presents a significant challenge for subsequent visual perception modeling and compression. In addition, the lack of benchmark CI databases also hinders the use of recent advanced datadriven methods. To address these challenges, we first establish one of the earliest visual redundancy prediction (VRP) databases for CIs. Moreover, we propose a multi-visual effect (MVE)-driven incremental learning method that combines the strengths of hand-crafted and data-driven approaches to achieve more accurate VRP modeling. Specifically, we design special incremental rules to learn the visual knowledge flow of MVE. To effectively capture the associated features of MVE, we further develop a three-stage incremental learning approach for VRP based on an encoder-decoder network. Extensive experimental results validate the superiority of the proposed method in terms of subjective, objective, and compression experiments.



Paperid:1131
Authors:Tongxin Wang, Mang Ye
Wuhan University, Wuhan University
Abstract:
Fashion image editing aims to edit an input image to obtain richer or distinct visual clothing matching effects. Existing global fashion image editing methods are difficult to achieve rich outfit combination effects while local fashion image editing is more in line with the needs of diverse and personalized outfit matching. The local editing techniques typically depend on text and auxiliary modalities (e.g., human poses, human keypoints, garment sketches, etc.) for image manipulation, where the auxiliary modalities essentially assist in locating the editing region. Since these auxiliary modalities usually involve additional efforts in practical application scenarios, textdriven fashion image editing shows high flexibility. In this paper, we propose TexFit, a Text-driven Fashion image Editing method using diffusion models, which performs the local image editing only with the easily accessible text. Our approach employs a text-based editing region location module to predict precise editing region in the fashion image. Then, we take the predicted region as the generation condition of diffusion models together with the text prompt to achieve precise local editing of fashion images while keeping the rest part intact. In addition, previous fashion datasets usually focus on global description, lacking local descriptive information that can guide the precise local editing. Therefore, we develop a new DFMM-Spotlight dataset by using region extraction and attribute combination strategies. It focuses locally on clothes and accessories, enabling local editing with text input. Experimental results on the DFMM-Spotlight dataset demonstrate the effectiveness of our model. Code and Datasets are available at https://texfit.github.io/.



Paperid:1132
Authors:Devin White, Mingkang Wu, Ellen Novoseller, Vernon J. Lawhern, Nicholas Waytowich, Yongcan Cao
University of Texas, San Antonio, University of Texas, San Antonio, DEVCOM Army Research Laboratory, DEVCOM Army Research Laboratory, DEVCOM Army Research Laboratory, University of Texas, San Antonio
Abstract:
This paper develops a novel ratingbased reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.



Paperid:1133
Authors:Di Wu, Wu Sun, Yi He, Zhong Chen, Xin Luo
College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China College of Computer and Information Science, Southwest University, Chongqing 400715, China, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China, Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA, School of Computing, Southern Illinois University, Carbondale, IL 62901, USA, College of Computer and Information Science, Southwest University, Chongqing 400715, China
Abstract:
Taking incompatible multiple drugs together may cause adverse interactions and side effects on the body. Accurate prediction of drugdrug interaction (DDI) events is essential for avoiding this issue. Recently, various artificial intelligence-based approaches have been proposed for predicting DDI events. However, DDI events are associated with complex relationships and mechanisms among drugs, targets, enzymes, transporters, molecular structures, etc. Existing approaches either partially or loosely consider these relationships and mechanisms by a non-end-to-end learning framework, resulting in sub-optimal feature extractions and fusions for prediction. Different from them, this paper proposes a Multimodal Knowledge Graph Fused End-to-end Neural Network (MKGFENN) that consists of two main parts: multimodal knowledge graph (MKG) and fused end-to-end neural network (FENN). First, MKG is constructed by comprehensively exploiting DDI events-associated relationships and mechanisms from four knowledge graphs of drugs-chemical entities, drug-substructures, drugs-drugs, and molecular structures. Correspondingly, a four channels graph neural network is designed to extract high-order and semantic features from MKG. Second, FENN designs a multi-layer perceptron to fuse the extracted features by end-to-end learning. With such designs, the feature extractions and fusions of DDI events are guaranteed to be comprehensive and optimal for prediction. Through extensive experiments on real drug datasets, we demonstrate that MKG-FENN exhibits high accuracy and significantly outperforms state-of-the-art models in predicting DDI events. The source code and supplementary file of this article are available on: https://github.com/wudi1989/MKG-FENN.



Paperid:1134
Authors:Xueyuan Yang, Chao Yao, Xiaojuan Ban
Beijing Advanced Innovation Center for Materials Genome Engineering, Beijing 100083, China. University of Science and Technology Beijing, Beijing 100083, China., Beijing Advanced Innovation Center for Materials Genome Engineering, Beijing 100083, China. University of Science and Technology Beijing, Beijing 100083, China., Beijing Advanced Innovation Center for Materials Genome Engineering, Beijing 100083, China. University of Science and Technology Beijing, Beijing 100083, China. Key Laboratory of Intelligent Bionic Unmanned Systems, Ministry of Education, Beijing 100083, China. Institute of Materials Intelligent Technology, Liaoning Academy of Materials, Shenyang 110004, China.
Abstract:
Leveraging wearable devices for motion reconstruction has emerged as an economical and viable technique. Certain methodologies employ sparse Inertial Measurement Units (IMUs) on the human body and harness datadriven strategies to model human poses. However, the reconstruction of motion based solely on sparse IMU data is inherently fraught with ambiguity, a consequence of numerous identical IMU readings corresponding to different poses. In this paper, we explore the spatial importance of sparse sensors, supervised by text that describes specific actions. Specifically, uncertainty is introduced to derive weighted features for each IMU. We also design a Hierarchical Temporal Transformer (HTT) and apply contrastive learning to achieve precise temporal and feature alignment of sensor data with textual semantics. Experimental results demonstrate our proposed approach achieves significant improvements in multiple metrics compared to existing methods. Notably, with textual supervision, our method not only differentiates between ambiguous actions such as sitting and standing but also produces more precise and natural motion.



Paperid:1135
Authors:Wenjie Yin, Yi Yu, Hang Yin, Danica Kragic, Mårten Björkman
KTH Royal Institute of Technology, NII, University of Copenhagen, KTH Royal Institute of Technology, KTH Royal Institute of Technology
Abstract:
Current training of motion style transfer systems relies on consistency losses across style domains to preserve contents, hindering its scalable application to a large number of domains and private data. Recent image transfer works show the potential of independent training on each domain by leveraging implicit bridging between diffusion models, with the content preservation, however, limited to simple data patterns. We address this by imposing biased sampling in backward diffusion while maintaining the domain independence in the training stage. We construct the bias from the source domain keyframes and apply them as the gradient of content constraints, yielding a framework with keyframe manifold constraint gradients (KMCGs). Our validation demonstrates the success of training separate models to transfer between as many as ten dance motion styles. Comprehensive experiments find a significant improvement in preserving motion contents in comparison to baseline and ablative diffusionbased style transfer models. In addition, we perform a human study for a subjective assessment of the quality of generated dance motions. The results validate the competitiveness of KMCGs.



Paperid:1136
Authors:Zahra Zahedi, Sailik Sengupta, Subbarao Kambhampati
Arizona State University, AWS AI Labs Arizona State University, Arizona State University
Abstract:
In this work, we design an Artificially Intelligent Task Allocator (AITA) that proposes a task allocation for a team of humans. A key property of this allocation is that when an agent with imperfect knowledge (about their teammate's costs and/or the team's performance metric) contests the allocation with a counterfactual, a contrastive explanation can always be provided to showcase why the proposed allocation is better than the proposed counterfactual. For this, we consider a negotiation process that produces a negotiationaware task allocation and, when contested, leverages a negotiation tree to provide a contrastive explanation. With human subject studies, we show that the proposed allocation indeed appears fair to a majority of participants and, when not, the explanations generated are judged as convincing and easy to comprehend.



Paperid:1137
Authors:Zhi Zhang, Shenghua Zhong, Yan Liu
Shenzhen University Hong Kong Polytechnic University, Shenzhen University, The Hong Kong Polytechnic University
Abstract:
In recent years, using Electroencephalography (EEG) to recognize emotions has garnered considerable attention. Despite advancements, limited EEG data restricts its potential. Thus, Generative Adversarial Networks (GANs) are proposed to mimic the observed distributions and generate EEG data. However, for imbalanced datasets, GANs struggle to produce reliable augmentations for underrepresented minority emotions by merely mimicking them. Thus, we introduce Emotional Subspace Constrained Generative Adversarial Networks (ESC-GAN) as an alternative to existing frameworks. We first propose the EEG editing paradigm, editing reference EEG signals from well-represented to under-represented emotional subspaces. Then, we introduce diversity-aware and boundary-aware losses to constrain the augmented subspace. Here, the diversity-aware loss encourages a diverse emotional subspace by enlarging the sample difference, while boundary-aware loss constrains the augmented subspace near the decision boundary where recognition models can be vulnerable. Experiments show ESC-GAN boosts emotion recognition performance on benchmark datasets, DEAP, AMIGOS, and SEED, while protecting against potential adversarial attacks. Finally, the proposed method opens new avenues for editing EEG signals under emotional subspace constraints, facilitating unbiased and secure EEG data augmentation.



Paperid:1138
Authors:Zuozhen Zhang, Junzhong Ji, Jinduo Liu
Beijing University of Technology, Beijing University of Technology, Beijing University of Technology
Abstract:
In recent years, the discovery of brain effective connectivity (EC) networks through computational analysis of functional magnetic resonance imaging (fMRI) data has gained prominence in neuroscience and neuroimaging. However, owing to the influence of diverse factors during data collection and processing, fMRI data typically exhibits high noise and limited sample characteristics, consequently leading to suboptimal performance of current methods. In this paper, we propose a novel brain effective connectivity discovery method based on metareinforcement learning, called MetaRLEC. The method mainly consists of three modules: actor, critic, and meta-critic. MetaRLEC first employs an encoder-decoder framework: the encoder utilizing a Transformer, converts noisy fMRI data into a state embedding; the decoder employing bidirectional LSTM, discovers brain region dependencies from the state and generates actions (EC networks). Then a critic network evaluates these actions, incentivizing the actor to learn higher-reward actions amidst the high-noise setting. Finally, a meta-critic framework facilitates online learning of historical state-action pairs, integrating an action-value neural network and supplementary training losses to enhance the model's adaptability to small-sample fMRI data. We conduct comprehensive experiments on both simulated and real-world data to demonstrate the efficacy of our proposed method.



Paperid:1139
Authors:Yun Zhong, Yiannis Demiris
Personal Robotics Lab, Dept. of Electrical and Electronic Engineering Imperial College London, Personal Robotics Lab, Dept. of Electrical and Electronic Engineering Imperial College London
Abstract:
Dance is generally considered to be complex for most people as it requires coordination of numerous body motions and accurate responses to the musical content and rhythm. Studies on automatic dance performance assessment could help people improve their sensorimotor skills and promote research in many fields, including human motion analysis and motion generation. Recent papers on dance performance assessment usually evaluate simple dance motions with a single task estimating final performance scores. In this paper, we propose DanceMVP: multi-task dance performance assessment via text prompting that solves three related tasks - (i) dance vocabulary recognition, (ii) dance performance scoring and (iii) dance rhythm evaluation. In the pre-training phase, we contrastively learn the primitive-based features of complex dance motion and music using the InfoNCE loss. For the downstream task, we propose a transformer-based text prompter to perform multi-task evaluations for the three proposed assessment tasks. Also, we build a multimodal dance-music dataset named ImperialDance. The novelty of our ImperialDance is that it contains dance motions for diverse expertise levels and a significant amount of repeating dance sequences for the same choreography to keep track of the dance performance progression. Qualitative results show that our pre-trained feature representation could cluster dance pieces for different dance genres, choreographies, expertise levels and primitives, which generalizes well on both ours and other dance-music datasets. The downstream experiments demonstrate the robustness and improvement of our method over several ablations and baselines across all three tasks, as well as monitoring the users' dance level progression.



Paperid:1140
Authors:Aakash, Indranil Saha
IIT Kanpur, IIT Kanpur
Abstract:
We study a variant of the multirobot goal assignment problem where a unique goal to each robot needs to be assigned while minimizing the largest cost of movement among the robots, called makespan. A significant step in solving this problem is to find the cost associated with the robot-goal pairs, which requires solving a complex path planning problem. We present OM, a scalable optimal algorithm that solves the multi-robot goal assignment problem by computing the paths for a significantly less number of robot-goal pairs compared to the state-of-the-art algorithms, leading to a computationally superior mechanism to solve the problem. We extensively evaluate our algorithm for hundreds of robots on randomly generated and standard workspaces. Our experimental results demonstrate that the proposed algorithm achieves a noticeable speedup over two state-of-the-art baseline algorithms.



Paperid:1141
Authors:Marcus Gozon, Jingjin Yu
University of Michigan, Rutgers University
Abstract:
In the 15puzzle game, 15 labeled square tiles are reconfigured on a 4 × 4 board through an escort, wherein each (time) step, a single tile neighboring it may slide into it, leaving the space previously occupied by the tile as the new escort. We study a generalized sliding-tile puzzle (GSTP) in which (1) there are 1+ escorts and (2) multiple tiles can move synchronously in a single time step. Compared with popular discrete multi-agent/robot motion models, GSTP provides a more accurate model for a broad array of high-utility applications, including warehouse automation and autonomous garage parking, but is less studied due to the more involved tile interactions. In this work, we analyze optimal GSTP solution structures, establishing that computing makespan optimal solutions for GSTP is NP-complete and developing polynomial time algorithms yielding makespans approximating the minimum with expected/high probability constant factors, assuming randomized start and goal configurations.



Paperid:1142
Authors:Weiwei Gu, Anant Sah, Nakul Gopalan
Arizona State University, Arizona State University, Arizona State University
Abstract:
We present a demonstrable framework for robots to learn novel visual concepts and visual tasks via insitu linguistic interactions with human users. Previous approaches in computer vision have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies and take this ability one step further to demonstrate novel task solving on robots along with the learned visual concepts. To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept, that is being taught, to its parent nodes within a concept hierarchy. This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ. Combining the two techniques, we present a demonstration on a real robot that learns visual task and concepts in one-shot from in-situ interactions with human users, and generalize to perform a novel visual task of the same type in zero-shot. As shown by the studies in the main conference paper, our system achieves a success rate of 50% on solving the whole task correctly with generalization where the baseline performs at 14% without any ability to generalize to novel tasks and concepts. We will demonstrate our working interactive learning pipeline at AAAI 2024 in person with our robot and other required hardware.



Paperid:1143
Authors:Jinglue Hang, Xiangbo Lin, Tianqiang Zhu, Xuanheng Li, Rina Wu, Xiaohong Ma, Yi Sun
School of Information and Communication Engineering, Dalian University of Technology, China, School of Information and Communication Engineering, Dalian University of Technology, China, School of Information and Communication Engineering, Dalian University of Technology, China, School of Information and Communication Engineering, Dalian University of Technology, China, School of Information and Communication Engineering, Dalian University of Technology, China, School of Information and Communication Engineering, Dalian University of Technology, China, School of Information and Communication Engineering, Dalian University of Technology, China
Abstract:
Robot grasp dataset is the basis of designing the robot's grasp generation model. Compared with the building grasp dataset for LowDOF grippers, it is harder for High-DOF dexterous robot hand. Most current datasets meet the needs of generating stable grasps, but they are not suitable for dexterous hands to complete human-like functional grasp, such as grasp the handle of a cup or pressing the button of a flashlight, so as to enable robots to complete subsequent functional manipulation action autonomously, and there is no dataset with functional grasp pose annotations at present. This paper develops a unique Cost-Effective Real-Simulation Annotation System by leveraging natural hand's actions. The system is able to capture a functional grasp of a dexterous hand in a simulated environment assisted by human demonstration in real world. By using this system, dexterous grasp data can be collected efficiently as well as cost-effective. Finally, we construct the first dexterous functional grasp dataset with rich pose annotations. A Functional Grasp Synthesis Model is also provided to validate the effectiveness of the proposed system and dataset. Our project page is: https://hjlllll.github.io/DFG/.



Paperid:1144
Authors:Dohyun Kim, Nayoung Oh, Deokmin Hwang, Daehyung Park
Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology
Abstract:
We aim to solve the problem of spatially localizing composite instructions referring to space: space grounding. Compared to current instance grounding, space grounding is challenging due to the illposedness of identifying locations referred to by discrete expressions and the compositional ambiguity of referring expressions. Therefore, we propose a novel probabilistic space-grounding methodology (LINGO-Space) that accurately identifies a probabilistic distribution of space being referred to and incrementally updates it, given subsequent referring expressions leveraging configurable polar distributions. Our evaluations show that the estimation using polar distributions enables a robot to ground locations successfully through 20 table-top manipulation benchmark tests. We also show that updating the distribution helps the grounding method accurately narrow the referring space. We finally demonstrate the robustness of the space grounding with simulated manipulation and real quadruped robot navigation tasks. Code and videos are available at https://lingo-space.github.io.



Paperid:1145
Authors:Xiaohan Li, Dong Liu, Jun Wu
Institute of Advanced Technology, University of Science and Technology of China, Institute of Advanced Technology, University of Science and Technology of China, Fudan University
Abstract:
The demand for 4D ( 3D+time ) SLAM system is increasingly urgent, especially for decisionmaking and scene understanding. However, most of the existing simultaneous localization and mapping ( SLAM ) systems primarily assume static environments. They fail to represent dynamic scenarios due to the challenge of establishing robust long-term spatiotemporal associations in dynamic object tracking. We address this limitation and propose CTO-SLAM, a monocular and RGB-D object-level 4D SLAM system to track moving objects and estimate their motion simultaneously. In this paper, we propose contour tracking, which introduces contour features to enhance the keypoint representation of dynamic objects and coupled with pixel tracking to achieve long-term robust object tracking. Based on contour tracking, we propose a novel sampling-based object pose initialization algorithm and the following adapted bundle adjustment ( BA ) optimization algorithm to estimate dynamic object poses with high accuracy. The CTO-SLAM system is verified on both KITTI and VKITTI datasets. The experimental results demonstrate that our system effectively addresses cumulative errors in long-term spatiotemporal association and hence obtains substantial improvements over the state-of-the-art systems. The source code is available at https://github.com/realXiaohan/CTO-SLAM.



Paperid:1146
Authors:Haicheng Liao, Zhenning Li, Huanming Shen, Wenxuan Zeng, Dongping Liao, Guofa Li, Chengzhong Xu
University of Macau, University of Macau, University of Electronic Science and Technology of China, Peking University, University of Macau, Chongqing Univesity, University of Macau
Abstract:
The ability to accurately predict the trajectory of surrounding vehicles is a critical hurdle to overcome on the journey to fully autonomous vehicles. To address this challenge, we pioneer a novel behavioraware trajectory prediction model (BAT) that incorporates insights and findings from traffic psychology, human behavior, and decision-making. Our model consists of behavior-aware, interaction-aware, priority-aware, and position-aware modules that perceive and understand the underlying interactions and account for uncertainty and variability in prediction, enabling higher-level learning and flexibility without rigid categorization of driving behavior. Importantly, this approach eliminates the need for manual labeling in the training process and addresses the challenges of non-continuous behavior labeling and the selection of appropriate time windows. We evaluate BAT's performance across the Next Generation Simulation (NGSIM), Highway Drone (HighD), Roundabout Drone (RounD), and Macao Connected Autonomous Driving (MoCAD) datasets, showcasing its superiority over prevailing state-of-the-art (SOTA) benchmarks in terms of prediction accuracy and efficiency. Remarkably, even when trained on reduced portions of the training data (25%), our model outperforms most of the baselines, demonstrating its robustness and efficiency in predicting vehicle trajectories, and the potential to reduce the amount of data required to train autonomous vehicles, especially in corner cases. In conclusion, the behavior-aware model represents a significant advancement in the development of autonomous vehicles capable of predicting trajectories with the same level of proficiency as human drivers. The project page is available on our GitHub.



Paperid:1147
Authors:Feng Lu, Shuting Dong, Lijun Zhang, Bingxi Liu, Xiangyuan Lan, Dongmei Jiang, Chun Yuan
Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory, Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Peng Cheng Laboratory Southern University of Science and Technology, Peng Cheng Laboratory, Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory
Abstract:
Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the tradeoff between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.



Paperid:1148
Authors:Karan Mirakhor, Sourav Ghosh, Dipanjan Das, Brojeshwar Bhowmick
TCS Research, Kolkata, India, TCS Research, Kolkata, India, TCS Research, Kolkata, India, TCS Research, Kolkata, India
Abstract:
Object rearrangement in a multiroom setup should produce a reasonable plan that reduces the agent's overall travel and the number of steps. Recent state-of-the-art methods fail to produce such plans because they rely on explicit exploration for discovering unseen objects due to partial observability and a heuristic planner to sequence the actions for rearrangement. This paper proposes a novel task planner to efficiently plan a sequence of actions to discover unseen objects and rearrange misplaced objects within an untidy house to achieve a desired tidy state. The proposed method introduces several innovative techniques, including (i) a method for discovering unseen objects using commonsense knowledge from large language models, (ii) a collision resolution and buffer prediction method based on Cross-Entropy Method to handle blocked goal and swap cases, (iii) a directed spatial graph-based state space for scalability, and (iv) deep reinforcement learning (RL) for producing an efficient plan to simultaneously discover unseen objects and rearrange the visible misplaced ones to minimize the overall traversal. The paper also presents new metrics and a benchmark dataset called MoPOR to evaluate the effectiveness of the rearrangement planning in a multi-room setting. The experimental results demonstrate that the proposed method effectively addresses the multi-room rearrangement problem.



Paperid:1149
Authors:Naman Shah, Siddharth Srivastava
Arizona State University, Arizona State University
Abstract:
This paper addresses the problem of inventing and using hierarchical representations for stochastic robotplanning problems. Rather than using hand-coded state or action representations as input, it presents new methods for learning how to create a high-level action representation for long-horizon, sparse reward robot planning problems in stochastic settings with unknown dynamics. After training, this system yields a robot-specific but environment independent planning system. Given new problem instances in unseen stochastic environments, it first creates zero-shot options (without any experience on the new environment) with dense pseudo-rewards and then uses them to solve the input problem in a hierarchical planning and refinement process. Theoretical results identify sufficient conditions for completeness of the presented approach. Extensive empirical analysis shows that even in settings that go beyond these sufficient conditions, this approach convincingly outperforms baselines by 2x in terms of solution time with orders of magnitude improvement in solution quality.



Paperid:1150
Authors:Junru Song, Yang Yang, Wei Peng, Weien Zhou, Feifei Wang, Wen Yao
Institute of Statistics and Big Data, Renmin University of China, National Innovation Institute of Defense Technology, Chinese Academy of Military Science Intelligent Game and Decision Laboratory, Chinese Academy of Military Science, National Innovation Institute of Defense Technology, Chinese Academy of Military Science Intelligent Game and Decision Laboratory, Chinese Academy of Military Science, National Innovation Institute of Defense Technology, Chinese Academy of Military Science Intelligent Game and Decision Laboratory, Chinese Academy of Military Science, Center for Applied Statistics, Renmin University of China School of Statistics, Renmin University of China, National Innovation Institute of Defense Technology, Chinese Academy of Military Science Intelligent Game and Decision Laboratory, Chinese Academy of Military Science
Abstract:
Soft robot design is an intricate field with unique challenges due to its complex and vast search space. In the past literature, evolutionary computation algorithms, including novel probabilistic generative models (PGMs), have shown potential in this realm. However, these methods are sample inefficient and predominantly focus on rigid robots in locomotion tasks, which limit their performance and application in robot design automation. In this work, we propose MorphVAE, an innovative PGM that incorporates a multitask training scheme and a meticulously crafted sampling technique termed ``continuous natural selection'', aimed at bolstering sample efficiency. This method empowers us to gain insights from assessed samples across diverse tasks and temporal evolutionary stages, while simultaneously maintaining a delicate balance between optimization efficiency and biodiversity. Through extensive experiments in various locomotion and manipulation tasks, we substantiate the efficiency of MorphVAE in generating high-performing and diverse designs, surpassing the performance of competitive baselines.



Paperid:1151
Authors:Sijie Wang, Rui She, Qiyu Kang, Xingchao Jian, Kai Zhao, Yang Song, Wee Peng Tay
Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University
Abstract:
The utilization of multimodal sensor data in visual place recognition (VPR) has demonstrated enhanced performance compared to single-modal counterparts. Nonetheless, integrating additional sensors comes with elevated costs and may not be feasible for systems that demand lightweight operation, thereby impacting the practical deployment of VPR. To address this issue, we resort to knowledge distillation, which empowers single-modal students to learn from cross-modal teachers without introducing additional sensors during inference. Despite the notable advancements achieved by current distillation approaches, the exploration of feature relationships remains an under-explored area. In order to tackle the challenge of cross-modal distillation in VPR, we present DistilVPR, a novel distillation pipeline for VPR. We propose leveraging feature relationships from multiple agents, including self-agents and cross-agents for teacher and student neural networks. Furthermore, we integrate various manifolds, characterized by different space curvatures for exploring feature relationships. This approach enhances the diversity of feature relationships, including Euclidean, spherical, and hyperbolic relationship modules, thereby enhancing the overall representational capacity. The experiments demonstrate that our proposed pipeline achieves state-of-the-art performance compared to other distillation baselines. We also conduct necessary ablation studies to show design effectiveness. The code is released at: https://github.com/sijieaaa/DistilVPR



Paperid:1152
Authors:Yuxin Wang, Zunlei Feng, Haofei Zhang, Yang Gao, Jie Lei, Li Sun, Mingli Song
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University of Technology, Ningbo Innnovation Center, Zhejiang University, Zhejiang University
Abstract:
Due to the inability to receive signals from the Global Navigation Satellite System (GNSS) in extreme conditions, achieving accurate and robust navigation for Unmanned Aerial Vehicles (UAVs) is a challenging task. Recently emerged, visionbased navigation has been a promising and feasible alternative to GNSS-based navigation. However, existing vision-based techniques are inadequate in addressing flight deviation caused by environmental disturbances and inaccurate position predictions in practical settings. In this paper, we present a novel angle robustness navigation paradigm to deal with flight deviation in point-to-point navigation tasks. Additionally, we propose a model that includes the Adaptive Feature Enhance Module, Cross-knowledge Attention-guided Module and Robust Task-oriented Head Module to accurately predict direction angles for high-precision navigation. To evaluate the vision-based navigation methods, we collect a new dataset termed as UAV_AR368. Furthermore, we design the Simulation Flight Testing Instrument (SFTI) using Google Earth to simulate different flight environments, thereby reducing the expenses associated with real flight testing. Experiment results demonstrate that the proposed model outperforms the state-of-the-art by achieving improvements of 26.0% and 45.6% in the success rate of arrival under ideal and disturbed circumstances, respectively.



Paperid:1153
Authors:Yantian Zha, Lin Guan, Subbarao Kambhampati
Arizona State University, Arizona State University, Arizona State University
Abstract:
Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use selfexplanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can leverage the explained important relations as guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of existing RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance. To foster further research in self-explanation-guided robot learning, we have made our demonstrations and code publicly accessible at https://github.com/YantianZha/SERLfD. For a deeper understanding of our work, interested readers can refer to our arXiv version at https://arxiv.org/pdf/2110.05286.pdf, including an accompanying appendix.



Paperid:1154
Authors:Tongzhou Zhang, Gang Wang, Yu Chen, Hai Zhang, Jue Hu
College of Computer Science and Technology, Jilin University, College of Computer Science and Technology, Jilin University College of Software, Jilin University State Key Laboratory of Automotive Simulation and Control, Jilin University Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, College of Software, Jilin University, National Key Laboratory of Science and Technology on Advanced Composites in Special Environments, Harbin Institute of Technology, National Key Laboratory of Science and Technology on Advanced Composites in Special Environments, Harbin Institute of Technology
Abstract:
Global localization is a challenging task for intelligent robots, as its accuracy directly contributes to the performance of downstream navigation and planning tasks. However, existing literature focus more on the place retrieval and the success rate of localization, with limited attention given to the metrics of position estimation. In this paper, a singleshot global LiDAR localization method is proposed with the ultimate goal of achieving high position accuracy, inspired by the positioning approach of multi-constellation localization systems. Initially, we perform coarse localization using global descriptors and select observation points along with their corresponding coordinates based on the obtained coarse localization results. Coordinates can be acquired from a pre-built map, GNSS, or other devices. Then, a lightweight LiDAR odometry method is designed to estimate the distance between the retrieved data and the observation points. Ultimately, the localization problem is transformed into an optimization problem of solving a system of multiple sphere equations. The experimental results on the KITTI dataset and the self-collected dataset demonstrate that our method achieves an average localization error (including errors in the z-axis) of 0.89 meters. In addition, it achieves retrieval efficiency of 0.357 s per frame on the former dataset and 0.214 s per frame on the latter one. Code and data are available at https://github.com/jlurobot/multi-constellation-localization.



Paperid:1155
Authors:Xiaze Zhang, Ziheng Ding, Qi Jing, Yuejie Zhang, Wenchao Ding, Rui Feng
School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Academy for Engineering and Technology, Fudan University, School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University Academy for Engineering and Technology, Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing
Abstract:
Point clouds have shown significant potential in various domains, including Simultaneous Localization and Mapping (SLAM). However, existing approaches either rely on dense point clouds to achieve high localization accuracy or use generalized descriptors to reduce map size. Unfortunately, these two aspects seem to conflict with each other. To address this limitation, we propose an unified architecture, DeepPointMap, achieving excellent preference on both aspects. We utilize neural network to extract highly representative and sparse neural descriptors from point clouds, enabling memoryefficient map representation and accurate multi-scale localization tasks (e.g., odometry and loop-closure). Moreover, we showcase the versatility of our framework by extending it to more challenging multi-agent collaborative SLAM. The promising results obtained in these scenarios further emphasize the effectiveness and potential of our approach.



Paperid:1156
Authors:Gianvincenzo Alfano, Sergio Greco, Francesco Parisi, Irina Trubitsyna
University of Calabria, University of Calabria, University of Calabria, University of Calabria
Abstract:
Dung’s Argumentation Framework (AF) has been extended in several directions. Among the numerous proposed extensions, three of them seem to be of particular interest and have correlations between them. These extensions are: constrained AF (CAF), where AF is augmented with (strong) constraints; epistemic AF (EAF), where AF is augmented with epistemic constraints; and incomplete AF (iAF), where arguments and attacks can be uncertain. While the complexity and expressiveness of CAF and iAF have been studied, that of EAF has not been explored so far. In this paper we investigate the complexity and expressivity of EAF. To this end, we first introduce the Labeled CAF (LCAF), a variation of CAF where constraints are defined over the alphabet of labeled arguments. Then, we investigate the complexity of credulous and skeptical reasoning and show that: i) EAF is more expressive than iAF (under preferred semantics), ii) although LCAF is a restriction of EAF where modal operators are not allowed, these frameworks have the same complexity, iii) the results for LCAF close a gap in the characterization of the complexity of CAF. Interestingly, even though EAF has the same complexity as LCAF, it allows modeling domain knowledge in a more natural and easyto-understand way.



Paperid:1157
Authors:Abu Mohammad Hammad Ali, Boting Yang, Sandra Zilles
University of Regina, University of Regina, University of Regina, Canada
Abstract:
This paper studies the design and analysis of approximation algorithms for aggregating preferences over combinatorial domains, represented using Conditional Preference Networks (CPnets). Its focus is on aggregating preferences over so-called swaps, for which optimal solutions in general are already known to be of exponential size. We first analyze a trivial 2-approximation algorithm that simply outputs the best of the given input preferences, and establish a structural condition under which the approximation ratio of this algorithm is improved to 4/3. We then propose a polynomial-time approximation algorithm whose outputs are provably no worse than those of the trivial algorithm, but often substantially better. A family of problem instances is presented for which our improved algorithm produces optimal solutions, while, for any ε, the trivial algorithm cannot attain a (2- ε)-approximation. These results may lead to the first polynomial-time approximation algorithm that solves the CP-net aggregation problem for swaps with an approximation ratio substantially better than 2.



Paperid:1158
Authors:Luca Andolfi, Gianluca Cima, Marco Console, Maurizio Lenzerini
Sapienza University of Rome, Sapienza University of Rome, Sapienza University of Rome, Sapienza University of Rome
Abstract:
Query answering for Knowledge Bases (KBs) amounts to extracting information from the various models of a KB, and presenting the user with an object that represents such information. In the vast majority of cases, this object consists of those tuples of constants that satisfy the query expression either in every model (certain answers) or in some model (possible answers). However, similarly to the case of incomplete databases, both these forms of answers are a lossy representation of all the knowledge inferable from the query and the queried KB. In this paper, we illustrate a formal framework to characterize the information that query answers for KBs are able to represent. As a first application of the framework, we study the informativeness of current query answering approaches, including the recently introduced partial answers. We then define a novel notion of answers, allowing repetition of variables across answer tuples. We show that these answers are capable of representing a meaningful form of information, and we also study their data complexity properties.



Paperid:1159
Authors:Ofer Arieli, Kees van Berkel, Christian Straßer
Tel-Aviv Academic College, Ruhr University Bochum, Ruhr University Bochum
Abstract:
We present a novel computational approach to resolving conflicts among norms by nonmonotonic normative reasoning (in constrained I/O logics). Our approach extends standard sequentbased proof systems and makes them more adequate to nonmonotonic reasoning by adding to the sequents annotations that keep track of what is known about the defeasible status of the derived sequents. This makes transparent the reasons according to which norms should be applicable or inapplicable, and accordingly the sequents that make use of such norms are accepted or retracted. We also show that this proof theoretic method has tight links to the semantics of formal argumentation frameworks. The outcome of this paper is thus a threefold characterization result that relates, in the context of nonmonotonic normative reasoning, three traditional ingredients of AI-based reasoning methods: maximally consistent sets of premises (in constrained I/O logics), derived sequents (which are accepted in corresponding annotated sequent calculi), and logical arguments (that belong to the grounded extensions of the induced logical argumentation frameworks).



Paperid:1160
Authors:Marco Calautti, Ester Livshits, Andreas Pieris, Markus Schneider
University of Milan, University of Edinburgh, University of Edinburgh University of Cyprus, University of Edinburgh
Abstract:
Explaining an answer to a Datalog query is an essential task towards Explainable AI, especially nowadays where Datalog plays a critical role in the development of ontologybased applications. A well-established approach for explaining a query answer is the so-called why-provenance, which essentially collects all the subsets of the input database that can be used to obtain that answer via some derivation process, typically represented as a proof tree. It is well known, however, that computing the why-provenance for Datalog queries is computationally expensive, and thus, very few attempts can be found in the literature. The goal of this work is to demonstrate how off-the-shelf SAT solvers can be exploited towards an efficient computation of the why-provenance for Datalog queries. Interestingly, our SAT-based approach allows us to build the why-provenance in an incremental fashion, that is, one explanation at a time, which is much more useful in a practical context than the one-shot computation of the whole set of explanations as done by existing approaches.



Paperid:1161
Authors:David M. Cerna, Andrew Cropper
Czech Academy of Sciences Institute of Computer Science, University of Oxford
Abstract:
The ability to generalise from a small number of examples is a fundamental challenge in machine learning. To tackle this challenge, we introduce an inductive logic programming (ILP) approach that combines negation and predicate invention. Combining these two features allows an ILP system to generalise better by learning rules with universally quantified bodyonly variables. We implement our idea in NOPI, which can learn normal logic programs with predicate invention, including Datalog programs with stratified negation. Our experimental results on multiple domains show that our approach can improve predictive accuracies and learning times.



Paperid:1162
Authors:Konrad K. Dabrowski, Eduard Eiben, Sebastian Ordyniak, Giacomo Paesani, Stefan Szeider
Newcastle University, Royal Holloway, University of London, University of Leeds, University of Leeds, TU Wien
Abstract:
We consider the NPhard problem of finding a smallest decision tree representing a classification instance in terms of a partially defined Boolean function. Small decision trees are desirable to provide an interpretable model for the given data. We show that the problem is fixed-parameter tractable when parameterized by the rank-width of the incidence graph of the given classification instance. Our algorithm proceeds by dynamic programming using an NLC decomposition obtained from a rank-width decomposition. The key to the algorithm is a succinct representation of partial solutions. This allows us to limit the space and time requirements for each dynamic programming step in terms of the parameter.



Paperid:1163
Authors:Federica Di Stefano, Mantas Šimkus
TU Wien, TU Wien Umeå University
Abstract:
This paper studies a stable model semantics for Description Logic (DL) knowledge bases (KBs) and for (possibly cyclic) terminologies, ultimately showing that terminologies under the proposed semantics can be equipped with effective reasoning algorithms. The semantics is derived using Quantified Equilibrium Logic, andin contrast to the usual semantics of DLs based on classical logic---supports default negation and allows to combine the open-world and the closed-world assumptions in a natural way. Towards understanding the computational properties of this and related formalisms, we show a strong undecidability result that applies not only to KBs under the stable model semantics, but also to the more basic setting of minimal model reasoning. Specifically, we show that concept satisfiability in minimal models of an ALCIO KB is undecidable. We then turn our attention to (possibly cyclic) DL terminologies, where ontological axioms are limited to definitions of concept names in terms of complex concepts. This restriction still yields a very rich setting. We show that standard reasoning problems, like concept satisfiability and subsumption, are ExpTime-complete for terminologies expressed in ALCI under the stable model semantics.



Paperid:1164
Authors:Yannis Dimopoulos, Wolfgang Dvorak, Matthias König, Anna Rapberger, Markus Ulbricht, Stefan Woltran
University of Cyprus, Department of Computer Science, TU Wien, Institute of Logic and Computation, TU Wien, Institute of Logic and Computation, Imperial College London, Department of Computing, Leipzig University, Department of Computer Science, TU Wien, Institute of Logic and Computation
Abstract:
Assumptionbased argumentation (ABA) is a powerful defeasible reasoning formalism which is based on the interplay of assumptions, their contraries, and inference rules. ABA with preferences (ABA+) generalizes the basic model by allowing qualitative comparison between assumptions. The integration of preferences however comes with a cost. In ABA+, the evaluation under two central and well-established semantics---grounded and complete semantics---is not guaranteed to yield an outcome. Moreover, while ABA frameworks without preferences allow for a graph-based representation in Dung-style frameworks, an according instantiation for general ABA+ frameworks has not been established so far. In this work, we tackle both issues: First, we develop a novel abstract argumentation formalism based on set-to-set attacks. We show that our so-called Hyper Argumentation Frameworks (HYPAFs) capture ABA+. Second, we propose relaxed variants of complete and grounded semantics for HYPAFs that yield an extension for all frameworks by design, while still faithfully generalizing the established semantics of Dung-style Argumentation Frameworks. We exploit the newly established correspondence between ABA+ and HYPAFs to obtain variants for grounded and complete ABA+ semantics that are guaranteed to yield an outcome. Finally, we discuss basic properties and provide a complexity analysis. Along the way, we settle the computational complexity of several ABA+ semantics.



Paperid:1165
Authors:Thorsten Engesser, Andreas Herzig, Elise Perrotin
IRIT, Toulouse, France, IRIT, CNRS, Toulouse, France, CRIL, CNRS, Lens, France
Abstract:
Epistemic planning is useful in situations where multiple agents have different knowledge and beliefs about the world, such as in robothuman interaction. One aspect that has been largely neglected in the literature is planning with observations in the presence of false beliefs. This is a particularly challenging problem because it requires belief revision. We introduce a simple specification language for reasoning about actions with knowledge and belief. We demonstrate our approach on well-known false-belief tasks such as the Sally-Anne Task and compare it to other action languages. Our logic leads to an epistemic planning formalism that is expressive enough to model second-order false-belief tasks, yet has the same computational complexity as classical planning.



Paperid:1166
Authors:David Fernández-Duque, Yoàv Montacute
University of Barcelona, University of Cambridge
Abstract:
Dynamical systems are abstract models of interaction between space and time. They are often used in fields such as physics and engineering to understand complex processes, but due to their general nature, they have found applications for studying computational processes, interaction in multiagent systems, machine learning algorithms and other computer science related phenomena. In the vast majority of applications, a dynamical system consists of the action of a continuous `transition function' on a metric space. In this work, we consider decidable formal systems for reasoning about such structures. Spatial logics can be traced back to the 1940's, but our work follows a more dynamic turn that these logics have taken due to two recent developments: the study of the topological mu-calculus, and the the integration of linear temporal logic with logics based on the Cantor derivative. In this paper, we combine dynamic topological logics based on the Cantor derivative and the `next point in time' operators with an expressively complete fixed point operator to produce a combination of the topological mu-calculus with linear temporal logic. We show that the resulting logics are decidable and have a natural axiomatisation. Moreover, we prove that these logics are complete for interpretations on the Cantor space, the rational numbers, and subspaces thereof.



Paperid:1167
Authors:Nicolas Fröhlich, Arne Meier
Leibniz Universit¨at Hannover Institut f¨ur Theoretische Informatik Appelstrasse 9a, 30167 Hannover, Germany, Leibniz Universit¨at Hannover Institut f¨ur Theoretische Informatik Appelstrasse 9a, 30167 Hannover, Germany
Abstract:
Expressing system specifications using Computation Tree Logic (CTL) formulas, formalising programs using Kripke structures, and then model checking the system is an established workflow in program verification and has wide applications in AI. In this paper, we consider the task of model enumeration, which asks for a uniform stream of output systems that satisfy the given specification. We show that, given a CTL formula and a system (potentially falsified by the formula), enumerating satisfying submodels is always hard for CTLregardless of which subset of CTL-operators is considered. As a silver lining on the horizon, we present fragments via restrictions on the allowed Boolean functions that still allow for fast enumeration.



Paperid:1168
Authors:Alessandro Gianola, Marco Montali, Sarah Winkler
INESC-ID/Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal, Faculty of Engineering, Free University of Bozen-Bolzano, Bolzano, Italy, Faculty of Engineering, Free University of Bozen-Bolzano, Bolzano, Italy
Abstract:
The need to model and analyse dynamic systems operating over complex data is ubiquitous in AI and neighboring areas, in particular business process management. Analysing such dataaware systems is a notoriously difficult problem, as they are intrinsically infinite-state. Existing approaches work for specific datatypes, and/or limit themselves to the verification of safety properties. In this paper, we lift both such limitations, studying for the first time linear-time verification for so-called data-aware processes modulo theories (DMTs), from the foundational and practical point of view. The DMT model is very general, as it supports processes operating over variables that can store arbitrary types of data, ranging over infinite domains and equipped with domain-specific predicates. Specifically, we provide four contributions. First, we devise a semi-decision procedure for linear-time verification of DMTs, which works for a very large class of datatypes obeying to mild model-theoretic assumptions. The procedure relies on a unique combination of automata-theoretic and cover computation techniques to respectively deal with linear-time properties and datatypes. Second, we identify an abstract, semantic property that guarantees the existence of a faithful finite-state abstraction of the original system, and show that our method becomes a decision procedure in this case. Third, we identify concrete, checkable classes of systems that satisfy this property, generalising several results in the literature. Finally, we present an implementation and an experimental evaluation over a benchmark of real-world data-aware business processes.



Paperid:1169
Authors:Markus Hecher, Rafael Kiesel
Massachusetts Institute of Technology, Cambridge, USA, TU Wien, Vienna, Austria
Abstract:
Answer Set Programming (ASP) is a generic problem modeling and solving framework with a strong focus on knowledge representation and a rapid growth of industrial applications. So far, the study of complexity resulted in characterizing hardness and determining their sources, finegrained insights in the form of dichotomy-style results, as well as detailed parameterized complexity landscapes. Unfortunately, for the well-known parameter treewidth disjunctive programs require double-exponential runtime under reasonable complexity assumptions. This quickly becomes out of reach. We deal with the classification of structural parameters for disjunctive ASP on the program's rule structure (incidence graph). First, we provide a polynomial kernel to obtain single-exponential runtime in terms of vertex cover size, despite subset-minimization being not represented in the program’s structure. Then we turn our attention to strictly better structural parameters between vertex cover size and treewidth. Here, we provide double-exponential lower bounds for the most prominent parameters in that range: treedepth, feedback vertex size, and cliquewidth. Based on this, we argue that unfortunately our options beyond vertex cover size are limited. Our results provide an in-depth hardness study, relying on a novel reduction from normal to disjunctive programs, trading the increase of complexity for an exponential parameter compression.



Paperid:1170
Authors:Thanh Lam Hoang, Marco Luca Sbodio, Marcos Martinez Galindo, Mykhaylo Zayats, Raul Fernandez-Diaz, Victor Valls, Gabriele Picco, Cesar Berrospi, Vanessa Lopez
IBM Research, Dublin, Ireland, IBM Research, Dublin, Ireland, IBM Research, Dublin, Ireland, IBM Research, Dublin, Ireland, IBM Research, Dublin, Ireland University College Dublin, Ireland, IBM Research, Dublin, Ireland, IBM Research, Dublin, Ireland, IBM Research, Zurich, Switzerland, IBM Research, Dublin, Ireland
Abstract:
Recent research on predicting the binding affinity between drug molecules and proteins use representations learned, through unsupervised learning techniques, from large databases of molecule SMILES and protein sequences. While these representations have significantly enhanced the predictions, they are usually based on a limited set of modalities, and they do not exploit available knowledge about existing relations among molecules and proteins. Our study reveals that enhanced representations, derived from multimodal knowledge graphs describing relations among molecules and proteins, lead to stateof-the-art results in well-established benchmarks (first place in the leaderboard for Therapeutics Data Commons benchmark ``Drug-Target Interaction Domain Generalization Benchmark", with an improvement of 8 points with respect to previous best result). Moreover, our results significantly surpass those achieved in standard benchmarks by using conventional pre-trained representations that rely only on sequence or SMILES data. We release our multimodal knowledge graphs, integrating data from seven public data sources, and which contain over 30 million triples. Pretrained models from our proposed graphs and benchmark task source code are also released.



Paperid:1171
Authors:Céline Hocquette, Andreas Niskanen, Matti Järvisalo, Andrew Cropper
University of Oxford, University of Helsinki, University of Helsinki, University of Oxford
Abstract:
Many inductive logic programming approaches struggle to learn programs from noisy data. To overcome this limitation, we introduce an approach that learns minimal description length programs from noisy data, including recursive programs. Our experiments on several domains, including drug design, game playing, and program synthesis, show that our approach can outperform existing approaches in terms of predictive accuracies and scale to moderate amounts of noise.



Paperid:1172
Authors:Petr Illner, Petr Kučera
Charles University, Charles University
Abstract:
This paper integrates weak decomposable negation normal form (wDNNF) circuits, introduced by Akshay et al. in 2018, into the knowledge compilation map. This circuit type generalises decomposable negation normal form (DNNF) circuits in such a way that they allow a restricted form of sharing variables among the inputs of a conjunction node. We show that wDNNF circuits have the same properties as DNNF circuits regarding the queries and transformations presented in the knowledge compilation map, whilst being strictly more succinct than DNNF circuits (that is, they can represent Boolean functions compactly). We also present and evaluate a knowledge compiler, called Bella, for converting CNF formulae into wDNNF circuits. Our experiments demonstrate that wDNNF circuits are suitable for configuration instances.



Paperid:1173
Authors:Mohimenul Kabir, Supratik Chakraborty, Kuldeep S. Meel
National University of Singapore, IIT Bombay, University of Toronto
Abstract:
Answer Set Programming (ASP) has emerged as a promising paradigm in knowledge representation and automated reasoning owing to its ability to model hard combinatorial problems from diverse domains in a natural way. Building on advances in propositional SAT solving, the past two decades have wit- nessed the emergence of well-engineered systems for solv- ing the answer set satisfiability problem, i.e., finding mod- els or answer sets for a given answer set program. In re- cent years, there has been growing interest in problems be- yond satisfiability, such as model counting, in the context of ASP. Akin to the early days of propositional model count- ing, state-of-the-art exact answer set counters do not scale well beyond small instances. Exact ASP counters struggle with handling larger input formulas. The primary contribu- tion of this paper is a new ASP counting framework, called sharpASP, which counts answer sets avoiding larger input formulas. This relies on an alternative way of defining answer sets that allows lifting of key techniques developed in the con- text of propositional model counting. Our extensive empirical analysis over 1470 benchmarks demonstrates significant per- formance gain over current state-of-the-art exact answer set counters. Specifically, by using sharpASP, we were able to solve 1062 benchmarks with PAR2 score of 3082 whereas using prior state-of-the-art, we could only solve 895 bench- marks with PAR2 score of 4205, all other experimental con- ditions being the same.



Paperid:1174
Authors:Christian Kindermann, Anne-Marie George, Bijan Parsia, Uli Sattler
Stanford University, University of Oslo, University of Manchester, University of Manchester
Abstract:
In this paper, we introduce the problem of rewriting finite formal languages using syntactic macros such that the rewriting is minimal in size. We present polynomialtime algorithms to solve variants of this problem and show their correctness. To demonstrate the practical relevance of the proposed problems and the feasibility and effectiveness of our algorithms in practice, we apply these to biomedical ontologies authored in OWL. We find that such rewritings can significantly reduce the size of ontologies by capturing repeated expressions with macros. This approach not only offers valuable assistance in enhancing ontology quality and comprehension but can also be seen as a general methodology for evaluating features of rewriting systems (including syntactic macros, templates, or other forms of rewriting rules), which can be analyzed in terms of their influence on computational problems.



Paperid:1175
Authors:Nadezda Alexandrovna Knorozova, Alessandro Ronca
RelationalAI, University of Oxford
Abstract:
Recurrent Neural Cascades (RNCs) are the recurrent neural networks with no cyclic dependencies among recurrent neurons. This class of recurrent networks has received a lot of attention in practice. Besides training methods for a fixed architecture such as backpropagation, the cascade architecture naturally allows for constructive learning methods, where recurrent nodes are added incrementally one at a time, often yielding smaller networks. Furthermore, acyclicity amounts to a structural prior that even for the same number of neurons yields a more favourable sample complexity compared to a fullyconnected architecture. A central question is whether the advantages of the cascade architecture come at the cost of a reduced expressivity. We provide new insights into this question. We show that the regular languages captured by RNCs with sign and tanh activation with positive recurrent weights are the star-free regular languages. In order to establish our results we developed a novel framework where capabilities of RNCs are assessed by analysing which semigroups and groups a single neuron is able to implement. A notable implication of our framework is that RNCs can achieve the expressivity of all regular languages by introducing neurons that can implement groups.



Paperid:1176
Authors:Francesco Kriegel
Technische Universität Dresden
Abstract:
We present an FCAbased axiomatization method that produces a complete OWL 2 EL TBox (the terminological part of an OWL 2 EL ontology) from a graph dataset in at most exponential time. We describe technical details that allow for efficient implementation as well as variations that dispense with the computation of extremely large axioms, thereby rendering the approach applicable albeit some completeness is lost. Moreover, we evaluate the prototype on real-world datasets.



Paperid:1177
Authors:Sean Lamont, Michael Norrish, Amir Dezfouli, Christian Walder, Paul Montague
Australian National University Defence Science and Technology Group, Australian National University, BIMLOGIQ, Google Deepmind, Defence Science and Technology Group
Abstract:
Artificial Intelligence for Theorem Proving (AITP) has given rise to a plethora of benchmarks and methodologies, particularly in Interactive Theorem Proving (ITP). Research in the area is fragmented, with a diverse set of approaches being spread across several ITP systems. This presents a significant challenge to the comparison of methods, which are often complex and difficult to replicate. Addressing this, we present BAIT, a framework for the fair and streamlined comparison of learning approaches in ITP. We demonstrate BAIT’s capabilities with an indepth comparison, across several ITP benchmarks, of state-of-the-art architectures applied to the problem of formula embedding. We find that Structure Aware Transformers perform particularly well, improving on techniques associated with the original problem sets. BAIT also allows us to assess the end-to-end proving performance of systems built on interactive environments. This unified perspective reveals a novel end-to-end system that improves on prior work. We also provide a qualitative analysis, illustrating that improved performance is associated with more semantically-aware embeddings. By streamlining the implementation and comparison of Machine Learning algorithms in the ITP context, we anticipate BAIT will be a springboard for future research.



Paperid:1178
Authors:Viet-Man Le, Alexander Felfernig, Thi Ngoc Trang Tran, Mathias Uta
Graz University of Technology, Graz, Austria University of Economics, Hue University, Hue, Vietnam, Graz University of Technology, Graz, Austria, Graz University of Technology, Graz, Austria School of Hospitality and Tourism, Hue University, Hue, Vietnam, Siemens Energy AG, Germany
Abstract:
Conflict detection is relevant in various application scenarios, ranging from interactive decisionmaking to the diagnosis of faulty knowledge bases. Conflicts can be regarded as sets of constraints that cause an inconsistency. In many scenarios (e.g., constraint-based configuration), conflicts are repeatedly determined for the same or similar sets of constraints. This misses out on the valuable opportunity for leveraging knowledge reuse and related potential performance improvements, which are extremely important, specifically interactive constraint-based applications. In this paper, we show how to integrate knowledge reuse concepts into non-instructive conflict detection. We introduce the InformedQX algorithm, which is a reuse-aware variant of QuickXPlain. The results of a related performance analysis with the Linux-2.6.3.33 configuration knowledge base show significant improvements in terms of runtime performance compared to QuickXPlain.



Paperid:1179
Authors:Yves Lesperance, Giuseppe De Giacomo, Maryam Rostamigiv, Shakil M. Khan
York University, Toronto, ON, Canada, University Oxford, Oxford, UK, University of Regina, Regina, SK, Canada, University of Regina, Regina, SK, Canada
Abstract:
We present a general framework for abstracting agent behavior in multiagent synchronous games in the situation calculus, which provides a first-order representation of the state and allows us to model how plays depend on the data and objects involved. We represent such games as action theories of a special form called situation calculus synchronous game structures (SCSGSs), in which we have a single action "tick" whose effects depend on the combination of moves selected by the players. In our framework, one specifies both an abstract SCSGS and a concrete SCSGS, as well as a refinement mapping that specifies how each abstract move is implemented by a Golog program defined over the concrete SCSGS. We define notions of sound and complete abstraction with respect to a mapping over such SCSGS. To express strategic properties on the abstract and concrete games we adopt a first-order variant of alternating-time mu-calculus mu-ATL-FO. We show that we can exploit abstraction in verifying mu-ATL-FO properties of SCSGSs under the assumption that agents can always execute abstract moves to completion even if not fully controlling their outcomes.



Paperid:1180
Authors:Ziyang Li, Jiani Huang, Jason Liu, Felix Zhu, Eric Zhao, William Dodds, Neelay Velingker, Rajeev Alur, Mayur Naik
University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania
Abstract:
Foundation models have vast potential to enable diverse AI applications. The powerful yet incomplete nature of these models has spurred a wide range of mechanisms to augment them with capabilities such as incontext learning, information retrieval, and code interpreting. We propose Vieira, a declarative framework that unifies these mechanisms in a general solution for programming with foundation models. Vieira follows a probabilistic relational paradigm and treats foundation models as stateless functions with relational inputs and outputs. It supports neuro-symbolic applications by enabling the seamless combination of such models with logic programs, as well as complex, multi-modal applications by streamlining the composition of diverse sub-models. We implement Vieira by extending the Scallop compiler with a foreign interface that supports foundation models as plugins. We implement plugins for 12 foundation models including GPT, CLIP, and SAM. We evaluate Vieira on 9 challenging tasks that span language, vision, and structured and vector databases. Our evaluation shows that programs in Vieira are concise, can incorporate modern foundation models, and have comparable or better accuracy than competitive baselines.



Paperid:1181
Authors:Ke Liang, Lingyuan Meng, Sihang Zhou, Wenxuan Tu, Siwei Wang, Yue Liu, Meng Liu, Long Zhao, Xiangjun Dong, Xinwang Liu
School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Intelligence Science and Technology, National University of Defense Technology, School of Computer, National University of Defense Technology, Intelligent Game and Decision Lab, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, Qilu University of Technology, Qilu University of Technology, School of Computer, National University of Defense Technology
Abstract:
GraIL and its variants have shown their promising capacities for inductive relation reasoning on knowledge graphs. However, the unidirectional message-passing mechanism hinders such models from exploiting hidden mutual relations between entities in directed graphs. Besides, the enclosing subgraph extraction in most GraIL-based models restricts the model from extracting enough discriminative information for reasoning. Consequently, the expressive ability of these models is limited. To address the problems, we propose a novel GraIL-based framework, termed MINES, by introducing a Message Intercommunication mechanism on the Neighbor-Enhanced Subgraph. Concretely, the message intercommunication mechanism is designed to capture the omitted hidden mutual information. It introduces bi-directed information interactions between connected entities by inserting an undirected/bi-directed GCN layer between uni-directed RGCN layers. Moreover, inspired by the success of involving more neighbors in other graph-based tasks, we extend the neighborhood area beyond the enclosing subgraph to enhance the information collection for inductive relation reasoning. Extensive experiments prove the promising capacity of the proposed MINES from various aspects, especially for the superiority, effectiveness, and transfer ability.



Paperid:1182
Authors:Kuldeep S. Meel, Supratik Chakraborty, S. Akshay
University of Toronto, Canada, Indian Institute of Technology Bombay, Mumbai, India, Indian Institute of Technology Bombay, Mumbai, India
Abstract:
The problem of model counting, i.e., counting satisfying assignments of a Boolean formula, is a fundamental problem in computer science, with diverse applications. Given #Phardness of the problem, many algorithms have been developed over the years to provide an approximate model count. Recently, building on the practical success of SAT-solvers used as NP oracles, the focus has shifted from theory to practical implementations of such algorithms. This has brought to focus new challenges. In this paper, we consider one such challenge – that of auditable deterministic approximate model counters wherein a counter should also generate a certificate, which allows a user (often with limited computational power) to independently audit whether the count returned by an invocation of the algorithm is indeed within the promised bounds. We start by examining a celebrated approximate model counting algorithm due to Stockmeyer that uses polynomially many calls to a \Sigma^2_P oracle, and show that it can be audited via a \Pi^2_P formula on (n^2 log^2 n) variables, where n is the number of variables in the original formula. Since n is often large (10’s to 100’s of thousands) for typical instances, we ask if the count of variables in the certificate formula can be reduced – a critical question towards potential implementation. We show that this improvement in certification can be achieved with a tradeoff in the counting algorithm’s complexity. Specifically, we develop new deterministic approximate model counting algorithms that invoke a \Sigma^3_P oracle, but can be certified using a \Pi^2_P formula on fewer variables: our final algorithm uses just (n log n) variables. Our study demonstrates that one can simplify certificate checking significantly if we allow the counting algorithm to access a slightly more powerful oracle. We believe this shows for the first time how the audit complexity can be traded for the complexity of approximate counting.



Paperid:1183
Authors:Sebastian Ordyniak, Giacomo Paesani, Mateusz Rychlicki, Stefan Szeider
University of Leeds, University of Leeds, University of Leeds, TU Wien
Abstract:
We develop a general algorithmic framework that allows us to obtain fixedparameter tractability for computing smallest symbolic models that represent given data. Our framework applies to all ML model types that admit a certain extension property. By showing this extension property for decision trees, decision sets, decision lists, and binary decision diagrams, we obtain that minimizing these fundamental model types is fixed-parameter tractable. Our framework even applies to ensembles, which combine individual models by majority decision.



Paperid:1184
Authors:Julian Parsert, Elizabeth Polgreen
University of Oxford University of Edinburgh, University of Edinburgh
Abstract:
Program synthesis is the task of automatically generating code based on a specification. In SyntaxGuided Synthesis (SyGuS) this specification is a combination of a syntactic template and a logical formula, and the result is guaranteed to satisfy both. We present a reinforcement-learning guided algorithm for SyGuS which uses Monte-Carlo Tree Search (MCTS) to search the space of candidate solutions. Our algorithm learns policy and value functions which, combined with the upper confidence bound for trees, allow it to balance exploration and exploitation. A common challenge in applying machine learning approaches to syntax-guided synthesis is the scarcity of training data. To address this, we present a method for automatically generating training data for SyGuS based on anti-unification of existing first-order satisfiability problems, which we use to train our MCTS policy. We implement and evaluate this setup and demonstrate that learned policy and value improve the synthesis performance over a baseline by over 26 percentage points in the training and testing sets. Our tool outperforms state-of-the-art tool cvc5 on the training set and performs comparably in terms of the total number of problems solved on the testing set (solving 23% of the benchmarks on which cvc5 fails). We make our data set publicly available, to enable further application of machine learning methods to the SyGuS problem.



Paperid:1185
Authors:Andoni Rodríguez, César Sánchez
IMDEA Software Institute, Madrid, Spain Universidad Politécnica de Madrid, Madrid, Spain, IMDEA Software Institute, Madrid, Spain
Abstract:
Reactive synthesis is the process of generate correct controllers from temporal logic specifications. Typically, synthesis is restricted to Boolean specifications in LTL. Recently, a Boolean abstraction technique allows to translate LTLT specifications that contain literals in theories into equi-realizable LTL specifications, but no full synthesis procedure exists yet. In synthesis modulo theories, the system receives valuations of environment variables (from a first-order theory T ) and outputs valuations of system variables from T . In this paper, we address how to syntheize a full controller using a combination of the static Boolean controller obtained from the Booleanized LTL specification together with on-the-fly queries to a solver that produces models of satisfiable existential T formulae. This is the first synthesis method for LTL modulo theories. Additionally, our method can produce adaptive responses which increases explainability and can improve runtime properties like performance. Our approach is applicable to both LTL modulo theories and LTLf modulo theories.



Paperid:1186
Authors:Zeynep G. Saribatur, Stefan Woltran
TU Wien, TU Wien
Abstract:
Answer Set Programming (ASP) is a prominent rulebased language for knowledge representation and reasoning with roots in logic programming and non-monotonic reasoning. The aim to capture the essence of removing (ir)relevant details in ASP programs led to the investigation of different notions, from strong persistence (SP) forgetting, to faithful abstractions, and, recently, strong simplifications, where the latter two can be seen as relaxed and strengthened notions of forgetting, respectively. Although it was observed that these notions are related, especially given that they have characterizations through the semantics for strong equivalence, it remained unclear whether they can be brought together. In this work, we bridge this gap by introducing a novel relativized equivalence notion, which is a relaxation of the recent simplification notion, that is able to capture all related notions from the literature. We provide the necessary and sufficient conditions for relativized simplifiability, which shows that the challenging part is for when the context programs do not contain all the atoms to remove. We then introduce an operator that combines projection and a relaxation of SP-forgetting to obtain the relativized simplifications. We furthermore provide complexity results that complete the overall picture.



Paperid:1187
Authors:Nicolas Schwind, Katsumi Inoue, Sébastien Konieczny, Pierre Marquis
National Institute of Advanced Industrial Science and Technology, Tokyo, Japan, National Institute of Informatics, Tokyo, Japan Graduate Institute for Advanced Studies, SOKENDAI, Tokyo, Japan, Univ. Artois, CNRS, CRIL, Lens, France, Univ. Artois, CNRS, CRIL, Lens, France Institut Universitaire de France
Abstract:
This paper presents BeliefFlow, a novel framework for representing how logical beliefs spread among interacting agents within a network. In a Belief Flow Network (BFN), agents communicate asynchronously. The agents' beliefs are represented using epistemic states, which encompass their current beliefs and conditional beliefs guiding future changes. When communication occurs between two connected agents, the receiving agent changes its epistemic state using an improvement operator, a wellknown type of rational iterated belief change operator that generalizes belief revision operators. We show that BFNs satisfy appealing properties, leading to two significant outcomes. First, in any BFN with strong network connectivity, the beliefs of all agents converge towards a global consensus. Second, within any BFN, we show that it is possible to compute an optimal strategy for influencing the global beliefs. This strategy, which involves controlling the beliefs of a least number of agents through bribery, can be identified from the topology of the network and can be computed in polynomial time.



Paperid:1188
Authors:Yexing Song, Meilin Wang, Zhijing Yang, Xiaoyu Xian, Yukai Shi
Guangdong University of Technology, Guangdong University of Technology, Guangdong University of Technology, CRRC Academy, Guangdong University of Technology
Abstract:
The capability of video superresolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/.



Paperid:1189
Authors:Giang Trinh, Belaid Benhamou, Samuel Pastva, Sylvain Soliman
LIRICA team, LIS, Aix-Marseille University, Marseille, France, LIRICA team, LIS, Aix-Marseille University, Marseille, France, Institute of Science and Technology, Klosterneuburg, Austria, Lifeware team, Inria Saclay, Palaiseau, France
Abstract:
Boolean Networks (BNs) are widely used as a modeling formalism in several domains, notably systems biology and computer science. A fundamental problem in BN analysis is the enumeration of trap spaces, which are hypercubes in the state space that cannot be escaped once entered. Several methods have been proposed for enumerating trap spaces, however they often suffer from scalability and efficiency issues, particularly for large and complex models. To our knowledge, the most efficient and recent methods for the trap space enumeration all rely on Answer Set Programming (ASP), which has been widely applied to the analysis of BNs. Motivated by these considerations, our work proposes a new method for enumerating trap spaces in BNs using ASP. We evaluate the method on a mix of 250+ realworld and 400+ randomly generated BNs, showing that it enables analysis of models beyond the capabilities of existing tools (namely pyboolnet, mpbn, trappist, and trapmvn).



Paperid:1190
Authors:Markus Ulbricht, Nico Potyka, Anna Rapberger, Francesca Toni
Leipzig University, Cardiff University, Imperial College London, Imperial College London
Abstract:
Assumptionbased Argumentation (ABA) is a well-known structured argumentation formalism, whereby arguments and attacks between them are drawn from rules, defeasible assumptions and their contraries. A common restriction imposed on ABA frameworks (ABAFs) is that they are flat, i.e. each of the defeasible assumptions can only be assumed, but not derived. While it is known that flat ABAFs can be translated into abstract argumentation frameworks (AFs) as proposed by Dung, no translation exists from general, possibly non-flat ABAFs into any kind of abstract argumentation formalism. In this paper, we close this gap and show that bipolar AFs (BAFs) can instantiate general ABAFs. To this end we develop suitable, novel BAF semantics which borrow from the notion of deductive support. We investigate basic properties of our BAFs, including computational complexity, and prove the desired relation to ABAFs under several semantics.



Paperid:1191
Authors:Zongshun Wang, Yuping Shen
Institute of Logic and Cognition, Department of Philosophy, Sun Yat-sen University, P.R.China, Institute of Logic and Cognition, Department of Philosophy, Sun Yat-sen University, P.R.China
Abstract:
argumentation is a reasoning model for evaluating arguments. Recently, gradual semantics has received considerable attention in weighted argumentation, which assigns an acceptability degree to each argument as its strength. In this paper, we aim to enhance gradual semantics by nonreciprocally incorporating the notion of rejectability degree. Such a setting offers a bilateral perspective on argument strength, enabling more comprehensive argument evaluations in practical situations. To this end, we first provide a set of principles for our semantics, taking both the acceptability and rejectability degrees into account, and propose three novel semantics conforming to the above principles. These semantics are defined as the limits of iterative sequences that always converge in any given weighted argumentation system, making them preferable for real-world applications.



Paperid:1192
Authors:Marco Wilhelm, Gabriele Kern-Isberner
TU Dortmund University, TU Dortmund University
Abstract:
It is wellknown from probability theory that network-based methods like Bayesian networks constitute remarkable frameworks for efficient probabilistic reasoning. In this paper, we focus on qualitative default reasoning based on Spohn’s ranking functions for which network-based methods have not yet been studied satisfactorily. With constraint networks, we develop a framework for iterative calculations of c-representations, a family of ranking models of conditional belief bases which show outstanding properties from a commonsense and formal point of view, that are characterized by assigning possible worlds a degree of implausibility via penalizing the falsification of conditionals. Constraint networks unveil the dependencies among these penalty points (and hence among the conditionals) and make it possible to compute the penalty points locally on so-called safe sub-bases. As an application of our framework, we show that skeptical c-inferences can be drawn locally from safe sub-bases without losing validity.



Paperid:1193
Authors:Xinyue Zhang, Pan Hu, Yavor Nenov, Ian Horrocks
University of Oxford, Shanghai Jiao Tong University, Oxford Semantic Techonologies, University of Oxford
Abstract:
Materialisation facilitates Datalog reasoning by precomputing all consequences of the facts and the rules so that queries can be directly answered over the materialised facts. However, storing all materialised facts may be infeasible in practice, especially when the rules are complex and the given set of facts is large. We observe that for certain combinations of rules, there exist data structures that compactly represent the reasoning result and can be efficiently queried when necessary. In this paper, we present a general framework that allows for the integration of such optimised storage schemes with standard materialisation algorithms. Moreover, we devise optimised storage schemes targeting at transitive rules and union rules, two types of (combination of) rules that commonly occur in practice. Our experimental evaluation shows that our approach significantly improves memory consumption, sometimes by orders of magnitude, while remaining competitive in terms of query answering time.



Paperid:1194
Authors:Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, Michalis Vazirgiannis
Laboratoire d’Informatique (LIX), École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France, Laboratoire d’Informatique (LIX), École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France, Epigenetics and Cell Fate, CNRS UMR7216, Université Paris Cité, F75013 Paris, France. Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus, Laboratoire d’Informatique (LIX), École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
Abstract:
In recent years, significant progress has been made in the field of protein function prediction with the development of various machinelearning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins. In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.



Paperid:1195
Authors:Sonny Achten, Francesco Tonin, Panagiotis Patrinos, Johan A.K. Suykens
KU Leuven, KU Leuven, KU Leuven, KU Leuven
Abstract:
We present a deep Graph Convolutional Kernel Machine (GCKM) for semisupervised node classification in graphs. The method is built of two main types of blocks: (i) We introduce unsupervised kernel machine layers propagating the node features in a one-hop neighborhood, using implicit node feature mappings. (ii) We specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. We derive an effective initialization scheme and efficient end-to-end training algorithm in the dual variables for the full architecture. The main idea underlying GCKM is that, because of the unsupervised core, the final model can achieve higher performance in semi-supervised node classification when few labels are available for training. Experimental results demonstrate the effectiveness of the proposed framework.



Paperid:1196
Authors:Nimesh Agrawal, Anuj Kumar Sirohi, Sandeep Kumar, Jayadeva
Department of Electrical Engineering, Indian Institute of Technology, Delhi, India, Yardi School of Artificial Intelligence, Indian Institute of Technology, Delhi, India, Department of Electrical Engineering, Indian Institute of Technology, Delhi, India Yardi School of Artificial Intelligence, Indian Institute of Technology, Delhi, India, Department of Electrical Engineering, Indian Institute of Technology, Delhi, India Yardi School of Artificial Intelligence, Indian Institute of Technology, Delhi, India
Abstract:
Ensuring fairness in Recommendation Systems (RSs) across demographic groups is critical due to the increased integration of RSs in applications such as personalized healthcare, finance, and ecommerce. Graph-based RSs play a crucial role in capturing intricate higher-order interactions among entities. However, integrating these graph models into the Federated Learning (FL) paradigm with fairness constraints poses formidable challenges as this requires access to the entire interaction graph and sensitive user information (such as gender, age, etc.) at the central server. This paper addresses the pervasive issue of inherent bias within RSs for different demographic groups without compromising the privacy of sensitive user attributes in FL environment with the graph-based model. To address the group bias, we propose F2PGNN (Fair Federated Personalized Graph Neural Network), a novel framework that leverages the power of Personalized Graph Neural Network (GNN) coupled with fairness considerations. Additionally, we use differential privacy techniques to fortify privacy protection. Experimental evaluation on three publicly available datasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47% ∼ 99% compared to the state-of-the-art while preserving privacy and maintaining the utility. The results validate the significance of our framework in achieving equitable and personalized recommendations using GNN within the FL landscape. Source code is at: https://github.com/nimeshagrawal/F2PGNN-AAAI24



Paperid:1197
Authors:Alaleh Ahmadianshalchi, Syrine Belakaria, Janardhan Rao Doppa
Washington State University, Stanford University, Washington State University
Abstract:
We consider the problem of multiobjective optimization (MOO) of expensive black-box functions with the goal of discovering high-quality and diverse Pareto fronts where we are allowed to evaluate a batch of inputs. This problem arises in many real-world applications including penicillin production where diversity of solutions is critical. We solve this problem in the framework of Bayesian optimization (BO) and propose a novel approach referred to as Pareto front-Diverse Batch Multi-Objective BO (PDBO). PDBO tackles two important challenges: 1) How to automatically select the best acquisition function in each BO iteration, and 2) How to select a diverse batch of inputs by considering multiple objectives. We propose principled solutions to address these two challenges. First, PDBO employs a multi-armed bandit approach to select one acquisition function from a given library. We solve a cheap MOO problem by assigning the selected acquisition function for each expensive objective function to obtain a candidate set of inputs for evaluation. Second, it utilizes Determinantal Point Processes (DPPs) to choose a Pareto-front-diverse batch of inputs for evaluation from the candidate set obtained from the first step. The key parameters for the methods behind these two steps are updated after each round of function evaluations. Experiments on multiple MOO benchmarks demonstrate that PDBO outperforms prior methods in terms of both the quality and diversity of Pareto solutions.



Paperid:1198
Authors:Motasem Alfarra, Zhipeng Cai, Adel Bibi, Bernard Ghanem, Matthias Müller
Intel Labs; King Abdullah University of Science and Technology (KAUST), Intel Labs, University of Oxford, King Abdullah University of Science and Technology (KAUST), Intel Labs
Abstract:
Continual Learning is a step towards lifelong intelligence where models continuously learn from recently collected data without forgetting previous knowledge. Existing continual learning approaches mostly focus on image classification in the classincremental setup with clear task boundaries and unlimited computational budget. This work explores the problem of Online Domain-Incremental Continual Segmentation (ODICS), where the model is continually trained over batches of densely labeled images from different domains, with limited computation and no information about the task boundaries. ODICS arises in many practical applications. In autonomous driving, this may correspond to the realistic scenario of training a segmentation model over time on a sequence of cities. We analyze several existing continual learning methods and show that they perform poorly in this setting despite working well in class-incremental segmentation. We propose SimCS, a parameter-free method complementary to existing ones that uses simulated data to regularize continual learning. Experiments show that SimCS provides consistent improvements when combined with different CL methods.



Paperid:1199
Authors:Meshal Alharbi, Mardavij Roozbehani, Munther Dahleh
Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology
Abstract:
The problem of sample complexity of online reinforcement learning is often studied in the literature without taking into account any partial knowledge about the system dynamics that could potentially accelerate the learning process. In this paper, we study the sample complexity of online Qlearning methods when some prior knowledge about the dynamics is available or can be learned efficiently. We focus on systems that evolve according to an additive disturbance model of the form S_{h+1} = ƒ(S_h, A_h) + W_h, where ƒ represents the underlying system dynamics, and W_h are unknown disturbances independent of states and actions. In the setting of finite episodic Markov decision processes with S states, A actions, and episode length H, we present an optimistic Q-learning algorithm that achieves Õ(Poly(H)√T) regret under perfect knowledge of ƒ, where T is the total number of interactions with the system. This is in contrast to the typical Õ(Poly(H)√SAT) regret for existing Q-learning methods. Further, if only a noisy estimate ƒ_hat of ƒ is available, our method can learn an approximately optimal policy in a number of samples that is independent of the cardinalities of state and action spaces. The sub-optimality gap depends on the approximation error ƒ_hat − ƒ, as well as the Lipschitz constant of the corresponding optimal value function. Our approach does not require modeling of the transition probabilities and enjoys the same memory complexity as model-free methods.



Paperid:1200
Authors:Nicholas Alonso, Jeffrey Krichmar, Emre Neftci
Department of Cognitive Science, University of California, Irvine, Department of Cognitive Science, University of California, Irvine Department of Computer Science, University of California, Irvine, Electrical Engineering and Information Technology RWTH Aachen, Germany Peter Grünberg Institute, Forschungszentrum Jülich, Germany
Abstract:
Backpropagation (BP), the standard learning algorithm for artificial neural networks, is often considered biologically implausible. In contrast, the standard learning algorithm for predictive coding (PC) models in neuroscience, known as the inference learning algorithm (IL), is a promising, bioplausible alternative. However, several challenges and questions hinder IL's application to real-world problems. For example, IL is computationally demanding, and without memory-intensive optimizers like Adam, IL may converge to poor local minima. Moreover, although IL can reduce loss more quickly than BP, the reasons for these speedups or their robustness remains unclear. In this paper, we tackle these challenges by 1) altering the standard implementation of PC circuits to substantially reduce computation, 2) developing a novel optimizer that improves the convergence of IL without increasing memory usage, and 3) establishing theoretical results that help elucidate the conditions under which IL is sensitive to second and higher-order information.



Paperid:1201
Authors:Hilal AlQuabeh, William de Vazelhes, Bin Gu
MBZUAI, MBZUAI, MBZUAI Jilin University
Abstract:
Pairwise learning, an important domain within machine learning, addresses loss functions defined on pairs of training examples, including those in metric learning and AUC maximization. Acknowledging the quadratic growth in computation complexity accompanying pairwise loss as the sample size grows, researchers have turned to online gradient descent (OGD) methods for enhanced scalability. Recently, an OGD algorithm emerged, employing gradient computation involving prior and most recent examples, a step that effectively reduces algorithmic complexity to O(T), with T being the number of received examples. This approach, however, confines itself to linear models while assuming the independence of example arrivals. We introduce a lightweight OGD algorithm that does not require the independence of examples and generalizes to kernel pairwise learning. Our algorithm builds the gradient based on a random example and a moving average representing the past data, which results in a sublinear regret bound with a complexity of O(T). Furthermore, through the integration of O(√T logT) random Fourier features, the complexity of kernel calculations is effectively minimized. Several experiments with real-world datasets show that the proposed technique outperforms kernel and linear algorithms in offline and online scenarios.



Paperid:1202
Authors:Patrick Altmeyer, Mojtaba Farmanbar, Arie van Deursen, Cynthia C. S. Liem
Delft University of Technology, ING Bank, Delft University of Technology, Delft University of Technology
Abstract:
Counterfactual explanations offer an intuitive and straightforward way to explain blackbox models and offer algorithmic recourse to individuals. To address the need for plausible explanations, existing work has primarily relied on surrogate models to learn how the input data is distributed. This effectively reallocates the task of learning realistic explanations for the data from the model itself to the surrogate. Consequently, the generated explanations may seem plausible to humans but need not necessarily describe the behaviour of the black-box model faithfully. We formalise this notion of faithfulness through the introduction of a tailored evaluation metric and propose a novel algorithmic framework for generating Energy-Constrained Conformal Counterfactuals that are only as plausible as the model permits. Through extensive empirical studies, we demonstrate that ECCCo reconciles the need for faithfulness and plausibility. In particular, we show that for models with gradient access, it is possible to achieve state-of-the-art performance without the need for surrogate models. To do so, our framework relies solely on properties defining the black-box model itself by leveraging recent advances in energy-based modelling and conformal prediction. To our knowledge, this is the first venture in this direction for generating faithful counterfactual explanations. Thus, we anticipate that ECCCo can serve as a baseline for future research. We believe that our work opens avenues for researchers and practitioners seeking tools to better distinguish trustworthy from unreliable models.



Paperid:1203
Authors:Ehsan Amid, Frank Nielsen, Richard Nock, Manfred K. Warmuth
Google DeepMind, Sony Computer Science Laboratories Inc., Google Research, Google Research
Abstract:
In the field of optimal transport, two prominent subfields face each other: (i) unregularized optimal transport, ``ala-Kantorovich'', which leads to extremely sparse plans but with algorithms that scale poorly, and (ii) entropic-regularized optimal transport, ``a-la-Sinkhorn-Cuturi'', which gets near-linear approximation algorithms but leads to maximally un-sparse plans. In this paper, we show that an extension of the latter to tempered exponential measures, a generalization of exponential families with indirect measure normalization, gets to a very convenient middle ground, with both very fast approximation algorithms and sparsity, which is under control up to sparsity patterns. In addition, our formulation fits naturally in the unbalanced optimal transport problem setting.



Paperid:1204
Authors:Shengwei An, Sheng-Yen Chou, Kaiyuan Zhang, Qiuling Xu, Guanhong Tao, Guangyu Shen, Siyuan Cheng, Shiqing Ma, Pin-Yu Chen, Tsung-Yi Ho, Xiangyu Zhang
Purdue University, The Chinese University of Hong Kong, Purdue University, Purdue University, Purdue University, Purdue University, Purdue University, University of Massachusetts Amherst, IBM Research, The Chinese University of Hong Kong, Purdue University
Abstract:
Diffusion models (DM) have become stateof-the-art generative models because of their capability of generating high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework Elijah on over hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility.



Paperid:1205
Authors:Wenbin An, Feng Tian, Wenkai Shi, Yan Chen, Yaqiang Wu, Qianying Wang, Ping Chen
School of Automation Science and Engineering, Xi'an Jiaotong University National Engineering Laboratory for Big Data Analytics, School of Computer Science and Technology, Xi'an Jiaotong University National Engineering Laboratory for Big Data Analytics, School of Automation Science and Engineering, Xi'an Jiaotong University National Engineering Laboratory for Big Data Analytics, School of Computer Science and Technology, Xi'an Jiaotong University, Lenovo Research, Lenovo Research, Department of Engineering, University of Massachusetts Boston
Abstract:
Generalized Category Discovery (GCD) is a crucial realworld task that aims to recognize both known and novel categories from an unlabeled dataset by leveraging another labeled dataset with only known categories. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. The former leads to unreliable estimation of learning targets for novel categories and the latter hinders models from learning discriminative features. To mitigate these two issues, we propose a Transfer and Alignment Network (TAN), which incorporates two knowledge transfer mechanisms to calibrate the biased knowledge and two feature alignment mechanisms to learn discriminative features. Specifically, we model different categories with prototypes and transfer the prototypes in labeled data to correct model bias towards known categories. On the one hand, we pull instances with known categories in unlabeled data closer to these prototypes to form more compact clusters and avoid boundary overlap between known and novel categories. On the other hand, we use these prototypes to calibrate noisy prototypes estimated from unlabeled data based on category similarities, which allows for more accurate estimation of prototypes for novel categories that can be used as reliable learning targets later. After knowledge transfer, we further propose two feature alignment mechanisms to acquire both instance- and category-level knowledge from unlabeled data by aligning instance features with both augmented features and the calibrated prototypes, which can boost model performance on both known and novel categories with less noise. Experiments on three benchmark datasets show that our model outperforms SOTA methods, especially on novel categories. Theoretical analysis is provided for an in-depth understanding of our model in general. Our code and data are available at https://github.com/Lackel/TAN.



Paperid:1206
Authors:Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences Objecteye Inc., Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences Objecteye Inc. Wuhan AI Research
Abstract:
Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of Large Language Models (LLMs). Retrainingfree is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at https://github.com/CASIA-IVA-Lab/FLAP.



Paperid:1207
Authors:Yunpyo An, Suyeong Park, Kwang In Kim
UNIST, UNIST, POSTECH
Abstract:
Retraining a deep learning model each time a single data point receives a new label is impractical due to the inherent complexity of the training process. Consequently, existing active learning (AL) algorithms tend to adopt a batch-based approach where, during each AL iteration, a set of data points is collectively chosen for annotation. However, this strategy frequently leads to redundant sampling, ultimately eroding the efficacy of the labeling procedure. In this paper, we introduce a new AL algorithm that harnesses the power of a Gaussian process surrogate in conjunction with the neural network principal learner. Our proposed model adeptly updates the surrogate learner for every new data instance, enabling it to emulate and capitalize on the continuous learning dynamics of the neural network without necessitating a complete retraining of the principal model for each individual label. Experiments on four benchmark datasets demonstrate that this approach yields significant enhancements, either rivaling or aligning with the performance of state-of-the-art techniques.



Paperid:1208
Authors:Ziyan An, Taylor T. Johnson, Meiyi Ma
Vanderbilt University, Vanderbilt University, Vanderbilt University
Abstract:
Recent advancements in federated learning (FL) have greatly facilitated the development of decentralized collaborative applications, particularly in the domain of Artificial Intelligence of Things (AIoT). However, a critical aspect missing from the current research landscape is the ability to enable datadriven client models with symbolic reasoning capabilities. Specifically, the inherent heterogeneity of participating client devices poses a significant challenge, as each client exhibits unique logic reasoning properties. Failing to consider these device-specific specifications can result in critical properties being missed in the client predictions, leading to suboptimal performance. In this work, we propose a new training paradigm that leverages temporal logic reasoning to address this issue. Our approach involves enhancing the training process by incorporating mechanically generated logic expressions for each FL client. Additionally, we introduce the concept of aggregation clusters and develop a partitioning algorithm to effectively group clients based on the alignment of their temporal reasoning properties. We evaluate the proposed method on two tasks: a real-world traffic volume prediction task consisting of sensory data from fifteen states and a smart city multi-task prediction utilizing synthetic data. The evaluation results exhibit clear improvements, with performance accuracy improved by up to 54% across all sequential prediction models.



Paperid:1209
Authors:Gautham Anil, Vishnu Vinod, Apurva Narayan
Indian Institute of Technology Madras, Indian Institute of Technology Madras, University of Western Ontario University of British Columbia University of Waterloo
Abstract:
Quantum Machine Learning (QML) has emerged as a promising field of research, aiming to leverage the capabilities of quantum computing to enhance existing machine learning methodologies. Recent studies have revealed that, like their classical counterparts, QML models based on Parametrized Quantum Circuits (PQCs) are also vulnerable to adversarial attacks. Moreover, the existence of Universal Adversarial Perturbations (UAPs) in the quantum domain has been demonstrated theoretically in the context of quantum classifiers. In this work, we introduce QuGAP: a novel framework for generating UAPs for quantum classifiers. We conceptualize the notion of additive UAPs for PQCbased classifiers and theoretically demonstrate their existence. We then utilize generative models (QuGAP-A) to craft additive UAPs and experimentally show that quantum classifiers are susceptible to such attacks. Moreover, we formulate a new method for generating unitary UAPs (QuGAP-U) using quantum generative models and a novel loss function based on fidelity constraints. We evaluate the performance of the proposed framework and show that our method achieves state-of-the-art misclassification rates, while maintaining high fidelity between legitimate and adversarial samples.



Paperid:1210
Authors:Srinivas Anumasa, Bhaskar Mukhoty, Velibor Bojkovic, Giulia De Masi, Huan Xiong, Bin Gu
Mohamed bin Zayed University of Artificial Intelligence, UAE, Mohamed bin Zayed University of Artificial Intelligence, UAE, Mohamed bin Zayed University of Artificial Intelligence, UAE, ARRC, Technology Innovation Institute, UAE BioRobotics Institute, Sant’Anna School of Advanced Studies Pisa, Italy, Mohamed bin Zayed University of Artificial Intelligence, UAE Harbin Institute of Technology, China, Mohamed bin Zayed University of Artificial Intelligence, UAE School of Artificial Intelligence, Jilin University, China
Abstract:
Spiking neural networks (SNNs) have garnered significant attention for their low power consumption when deployed on neuromorphic hardware that operates in orders of magnitude lower power than generalpurpose hardware. Direct training methods for SNNs come with an inherent latency for which the SNNs are optimized, and in general, the higher the latency, the better the predictive powers of the models, but at the same time, the higher the energy consumption during training and inference. Furthermore, an SNN model optimized for one particular latency does not necessarily perform well in lower latencies, which becomes relevant in scenarios where it is necessary to switch to a lower latency because of the depletion of onboard energy or other operational requirements. In this work, we propose Stochastic Latency Training (SLT), a direct training method for SNNs that optimizes the model for the given latency but simultaneously offers a minimum reduction of predictive accuracy when shifted to lower inference latencies. We provide heuristics for our approach with partial theoretical justification and experimental evidence showing the state-of-the-art performance of our models on datasets such as CIFAR-10, DVS-CIFAR-10, CIFAR-100, and DVS-Gesture. Our code is available at https://github.com/srinuvaasu/SLT



Paperid:1211
Authors:Caridad Arroyo Arevalo, Sayedeh Leila Noorbakhsh, Yun Dong, Yuan Hong, Binghui Wang
Illinois Institute or Technology, Illinois institute of technology, Benedictine University, University of Connecticut, Illinois Institute of Technology
Abstract:
Federated learning (FL) has been widely studied recently due to its property to collaboratively train data from different devices without sharing the raw data. Nevertheless, recent studies show that an adversary can still be possible to infer private information about devices' data, e.g., sensitive attributes such as income, race, and sexual orientation. To mitigate the attribute inference attacks, various existing privacypreserving FL methods can be adopted/adapted. However, all these existing methods have key limitations: they need to know the FL task in advance, or have intolerable computational overheads or utility losses, or do not have provable privacy guarantees. We address these issues and design a task-agnostic privacy-preserving presentation learning method for FL (TAPPFL) against attribute inference attacks. TAPPFL is formulated via information theory. Specifically, TAPPFL has two mutual information goals, where one goal learns task-agnostic data representations that contain the least information about the private attribute in each device's data, and the other goal ensures the learnt data representations include as much information as possible about the device data to maintain FL utility. We also derive privacy guarantees of TAPPFL against worst-case attribute inference attacks, as well as the inherent tradeoff between utility preservation and privacy protection. Extensive results on multiple datasets and applications validate the effectiveness of TAPPFL to protect data privacy, maintain the FL utility, and be efficient as well. Experimental results also show that TAPPFL outperforms the existing defenses.



Paperid:1212
Authors:Shivvrat Arya, Tahrima Rahman, Vibhav Gogate
The University of Texas at Dallas, The University of Texas at Dallas, The University of Texas at Dallas
Abstract:
Probabilistic circuits (PCs) such as sumproduct networks efficiently represent large multi-variate probability distributions. They are preferred in practice over other probabilistic representations, such as Bayesian and Markov networks, because PCs can solve marginal inference (MAR) tasks in time that scales linearly in the size of the network. Unfortunately, the most probable explanation (MPE) task and its generalization, the marginal maximum-a-posteriori (MMAP) inference task remain NP-hard in these models. Inspired by the recent work on using neural networks for generating near-optimal solutions to optimization problems such as integer linear programming, we propose an approach that uses neural networks to approximate MMAP inference in PCs. The key idea in our approach is to approximate the cost of an assignment to the query variables using a continuous multilinear function and then use the latter as a loss function. The two main benefits of our new method are that it is self-supervised, and after the neural network is learned, it requires only linear time to output a solution. We evaluate our new approach on several benchmark datasets and show that it outperforms three competing linear time approximations: max-product inference, max-marginal inference, and sequential estimation, which are used in practice to solve MMAP tasks in PCs.



Paperid:1213
Authors:Vedang Asgaonkar, Aditya Jain, Abir De
IIT Bombay, IIT Bombay, IIT Bombay
Abstract:
Given a set of observations, feature acquisition is about finding the subset of unobserved features which would enhance accuracy. Such problems has been explored in a sequential setting in prior work. Here, the model receives feedback from every new feature acquireed and chooses to explore more features or to predict. However, sequential acquisition is not feasible in some settings where time is of essence. We consider the problem of feature acquisition in batch, where the subset of features to be queried in batch is chosen based on the currently observed features, and then acquired as a batch, followed by prediction. We solve this problem using several technical innovations. First, we use a feature generator to draw a subset of the synthetic features for some examples, which reduces the cost of oracle queries. Second, to make the feature acquisition problem tractable for the large heterogeneous observed features, we partition the data into buckets, by borrowing tools from locality sensitive hashing and then train a mixture of experts model. Third, we design a tractable lower bound of the original objective. We use a greedy algorithm combined with model training to solve the underlying problem. Experiments with four datasets shows that our approach outperforms these methods in terms of trade off between accuracy and feature acquisition cost.



Paperid:1214
Authors:Johannes Aspman, Georgios Korpas, Jakub Marecek
Czech Technical University, HSBC Czech Technical University, Czech Technical University
Abstract:
There has been a great deal of recent interest in binarized neural networks, especially because of their explainability. At the same time, automatic differentiation algorithms such as backpropagation fail for binarized neural networks, which limits their applicability. We show that binarized neural networks admit a tame representation by reformulating the problem of training binarized neural networks as a subadditive dual of a mixedinteger program, which we show to have nice properties. This makes it possible to use the framework of Bolte et al. for implicit differentiation, which offers the possibility for practical implementation of backpropagation in the context of binarized neural networks. This approach could also be used for a broader class of mixed-integer programs, beyond the training of binarized neural networks, as encountered in symbolic approaches to AI and beyond.



Paperid:1215
Authors:Alexia Atsidakou, Constantine Caramanis, Evangelia Gergatsouli, Orestis Papadigenopoulos, Christos Tzamos
University of Texas - Austin, University of Texas - Austin, University of Wisconsin - Madison, Columbia University, University of Wisconsin - Madison University of Athens
Abstract:
Pandora’s Box is a fundamental stochastic optimization problem, where the decisionmaker must find a good alternative, while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate distributions are given for the values of all the alternatives, while recent work studies the online variant of Pandora’s Box where the distributions are originally unknown. In this work, we study Pandora’s Box in the online setting, while incorporating context. At each round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well against the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is a novel modification of the realizability condition in contextual bandits that connects a context to a sufficient statistic of each alternative’s distribution (its reservation value) rather than its mean.



Paperid:1216
Authors:Guy Azran, Mohamad H. Danesh, Stefano V. Albrecht, Sarah Keren
Technion - Israel Institute of Technology, McGill University, University of Edinburgh, Technion - Israel Institute of Technology
Abstract:
Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RMs), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Empirical results show that our representations improve sample efficiency and fewshot transfer in a variety of domains.



Paperid:1217
Authors:Maryam Badar, Sandipan Sikdar, Wolfgang Nejdl, Marco Fisichella
L3S Research Center, Leibniz University Hannover, Germany, L3S Research Center, Leibniz University Hannover, Germany, L3S Research Center, Leibniz University Hannover, Germany, L3S Research Center, Leibniz University Hannover, Germany
Abstract:
As Federated Learning (FL) gains prominence in distributed machine learning applications, achieving fairness without compromising predictive performance becomes paramount. The data being gathered from distributed clients in an FL environment often leads to class imbalance. In such scenarios, balanced accuracy rather than accuracy is the true representation of model performance. However, most stateof-the-art fair FL methods report accuracy as the measure of performance, which can lead to misguided interpretations of the model's effectiveness to mitigate discrimination. To the best of our knowledge, this work presents the first attempt towards achieving Pareto-optimal trade-offs between balanced accuracy and fairness in a federated environment (FairTrade). By utilizing multi-objective optimization, the framework negotiates the intricate balance between model's balanced accuracy and fairness. The framework's agnostic design adeptly accommodates both statistical and causal fairness notions, ensuring its adaptability across diverse FL contexts. We provide empirical evidence of our framework's efficacy through extensive experiments on five real-world datasets and comparisons with six baselines. The empirical results underscore the potential of our framework in improving the trade-off between fairness and balanced accuracy in FL applications.



Paperid:1218
Authors:Jianhong Bai, Yuchen Yang, Huanpeng Chu, Hualiang Wang, Zuozhu Liu, Ruizhe Chen, Xiaoxuan He, Lianrui Mu, Chengfei Cai, Haoji Hu
Zhejiang University, Zhejiang University, Kuaishou Technology, The Hong Kong University of Science and Technology, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Tencent Data Platform, Zhejiang University
Abstract:
Quantization has emerged as a promising direction for model compression. Recently, datafree quantization has been widely studied as a promising method to avoid privacy concerns, which synthesizes images as an alternative to real training data. Existing methods use classification loss to ensure the reliability of the synthesized images. Unfortunately, even if these images are well-classified by the pre-trained model, they still suffer from low semantics and homogenization issues. Intuitively, these low-semantic images are sensitive to perturbations, and the pre-trained model tends to have inconsistent output when the generator synthesizes an image with low semantics. To this end, we propose Robustness-Guided Image Synthesis (RIS), a simple but effective method to enrich the semantics of synthetic images and improve image diversity, further boosting the performance of data-free compression tasks. Concretely, we first introduce perturbations on input and model weight, then define the inconsistency metrics at feature and prediction levels before and after perturbations. On the basis of inconsistency on two levels, we design a robustness optimization objective to eliminate low-semantic images. Moreover, we also make our approach diversity-aware by forcing the generator to synthesize images with small correlations. With RIS, we achieve state-of-the-art performance for various settings on data-free quantization and can be extended to other data-free compression tasks.



Paperid:1219
Authors:Qinbo Bai, Washim Uddin Mondal, Vaneet Aggarwal
Purdue University, Purdue University, Purdue University
Abstract:
In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradientbased algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a vanilla policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has O(T^3/4) regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.



Paperid:1220
Authors:Sikai Bai, Shuaicheng Li, Weiming Zhuang, Jie Zhang, Kunlin Yang, Jun Hou, Shuai Yi, Shuai Zhang, Junyu Gao
The Hong Kong University of Science and Technology, SenseTime Research, Sony AI, The Hong Kong Polytechnic University, SenseTime Research, SenseTime Research, SenseTime Research, SenseTime Research, Northwestern Polytechnical University
Abstract:
Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semisupervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11% on CIFAR-10 and CINIC-10 datasets.



Paperid:1221
Authors:Malyaban Bal, Abhronil Sengupta
The Pennsylvania State University, The Pennsylvania State University
Abstract:
Large language Models (LLMs), though growing exceedingly powerful, comprises of orders of magnitude less neurons and synapses than the human brain. However, it requires significantly more power/energy to operate. In this work, we propose a novel bioinspired spiking language model (LM) which aims to reduce the computational cost of conventional LMs by drawing motivation from the synaptic information flow in the brain. In this paper, we demonstrate a framework that leverages the average spiking rate of neurons at equilibrium to train a neuromorphic spiking LM using implicit differentiation technique, thereby overcoming the non-differentiability problem of spiking neural network (SNN) based algorithms without using any type of surrogate gradient. The steady-state convergence of the spiking neurons also allows us to design a spiking attention mechanism, which is critical in developing a scalable spiking LM. Moreover, the convergence of average spiking rate of neurons at equilibrium is utilized to develop a novel ANN-SNN knowledge distillation based technique wherein we use a pre-trained BERT model as “teacher” to train our “student” spiking architecture. While the primary architecture proposed in this paper is motivated by BERT, the technique can be potentially extended to different kinds of LLMs. Our work is the first one to demonstrate the performance of an operational spiking LM architecture on multiple different tasks in the GLUE benchmark. Our implementation source code is available at https://github.com/NeuroCompLab-psu/SpikingBERT.



Paperid:1222
Authors:Wei-Xuan Bao, Yong Rui, Min-Ling Zhang
School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China, Lenovo Research, Lenovo Group Ltd., Beijing, China, School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China
Abstract:
Partial label learning (PLL) induces a multiclass classifier from training examples each associated with a set of candidate labels, among which only one is valid. The formation of real-world data typically arises from heterogeneous entanglement of series latent explanatory factors, which are considered intrinsic properties for discriminating between different patterns. Though learning disentangled representation is expected to facilitate label disambiguation for partial-label (PL) examples, few existing works were dedicated to addressing this issue. In this paper, we make the first attempt towards disentangled PLL and propose a novel approach named TERIAL, which makes predictions according to derived disentangled representation of instances and label embeddings. The TERIAL approach formulates the PL examples as an undirected bipartite graph where instances are only connected with their candidate labels, and employs a tailored neighborhood routing mechanism to yield disentangled representation of nodes in the graph. Specifically, the proposed routing mechanism progressively infers the explanatory factors that contribute to the edge between adjacent nodes and augments the representation of the central node with factor-aware embedding information propagated from specific neighbors simultaneously via iteratively analyzing the promising subspace clusters formed by the node and its neighbors. The estimated labeling confidence matrix is also introduced to accommodate unreliable links owing to the inherent ambiguity of PLL. Moreover, we theoretically prove that the neighborhood routing mechanism will converge to the point estimate that maximizes the marginal likelihood of observed PL training examples. Comprehensive experiments over various datasets demonstrate that our approach outperforms the state-of-the-art counterparts.



Paperid:1223
Authors:Yanhao Bao, Tatsukichi Shibuya, Ikuro Sato, Rei Kawakami, Nakamasa Inoue
Tokyo Institute of Technology, Tokyo Institute of Technology, Tokyo Institute of Technology Denso IT Laboratory, Tokyo Institute of Technology, Tokyo Institute of Technology
Abstract:
Exploring biologically plausible algorithms as alternatives to error backpropagation (BP) is a challenging research topic in artificial intelligence. It also provides insights into the brain's learning methods. Recently, when combined with welldesigned feedback loss functions such as Local Difference Reconstruction Loss (LDRL) and through hierarchical training of feedback pathway synaptic weights, Target Propagation (TP) has achieved performance comparable to BP in image classification tasks. However, with an increase in the number of network layers, the tuning and training cost of feedback weights escalates. Drawing inspiration from the work of Ernoult et al., we propose a training method that seeks the optimal solution for feedback weights. This method enhances the efficiency of feedback training by analytically minimizing feedback loss, allowing the feedback layer to skip certain local training iterations. More specifically, we introduce the Jacobian matching loss (JML) for feedback training. We also proactively implement layers designed to derive analytical solutions that minimize JML. Through experiments, we have validated the effectiveness of this approach. Using the CIFAR-10 dataset, our method showcases accuracy levels comparable to state-of-the-art TP methods. Furthermore, we have explored its effectiveness in more intricate network architectures.



Paperid:1224
Authors:Samyadeep Basu, Shell Hu, Daniela Massiceti, Soheil Feizi
University of Maryland, Samsung Research, Microsoft Research, University of Maryland
Abstract:
Fewshot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase on a set of base classes. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters. While these methods have shown promise, inconsistencies in experimental conditions make it difficult to disentangle their advantage from other experimental factors including the feature extractor architecture, pre-trained initialization and fine-tuning algorithm, amongst others. In our paper, we conduct a large-scale, experimentally consistent, empirical analysis to study PEFTs for few-shot image classification. Through a battery of over 1.8k controlled experiments on large-scale few-shot benchmarks including Meta-Dataset and ORBIT, we uncover novel insights on PEFTs that cast light on their efficacy in fine-tuning ViTs for few-shot classification. Through our controlled empirical study, we have two main findings: (i) Fine-tuning just the LayerNorm parameters (which we call LN-Tune) during few-shot adaptation is an extremely strong baseline across ViTs pre-trained with both self-supervised and supervised objectives, (ii) For self-supervised ViTs, we find that simply learning a set of scaling parameters for each attention matrix (which we call Attn-Scale) along with a domain-residual adapter (DRA) module leads to state-of-the-art performance (while being ~9x more parameter-efficient) on Meta-Dataset. Our empirical findings set strong baselines and call for rethinking the current design of PEFT methods for FSC.



Paperid:1225
Authors:Maya Bechler-Speicher, Amir Globerson, Ran Gilad-Bachrach
Blavatnik School of Computer Science, Tel-Aviv University, Blavatnik School of Computer Science, Tel-Aviv University, Department of Bio-Medical Engineering and Edmond J. Safra Center for Bioinformatics,Tel-Aviv University
Abstract:
When dealing with tabular data, models based on decision trees are a popular choice due to their high accuracy on these data types, their ease of application, and explainability properties. However, when it comes to graphstructured data, it is not clear how to apply them effectively, in a way that in- corporates the topological information with the tabular data available on the vertices of the graph. To address this challenge, we introduce TREE-G. TREE-G modifies standard decision trees, by introducing a novel split function that is specialized for graph data. Not only does this split function incorporate the node features and the topological information, but it also uses a novel pointer mechanism that allows split nodes to use information computed in previous splits. Therefore, the split function adapts to the predictive task and the graph at hand. We analyze the theoretical properties of TREE-G and demonstrate its benefits empirically on multiple graph and vertex prediction benchmarks. In these experiments, TREE-G consistently outperforms other tree-based models and often outperforms other graph-learning algorithms such as Graph Neural Networks (GNNs) and Graph Kernels, sometimes by large margins. Moreover, TREE-Gs models and their predic tions can be explained and visualized.



Paperid:1226
Authors:Alexis Bellot, Junzhe Zhang, Elias Bareinboim
Google DeepMind, Columbia University, Columbia University
Abstract:
Structural learning is arguably one of the most challenging and pervasive tasks found throughout the data sciences. There exists a growing literature that studies structural learning in nonparametric settings where conditional independence constraints are taken to define the equivalence class. In the presence of unobserved confounders, it is understood that non-conditional independence constraints are imposed over the observational distribution, including certain equalities and inequalities between functionals of the joint distribution. In this paper, we develop structural learning methods that leverage additional constraints beyond conditional independences. Specifically, we first introduce a score for arbitrary graphs combining Watanabe's asymptotic expansion of the marginal likelihood and new bounds over the cardinality of the exogenous variables. Second, we show that the new score has desirable properties in terms of expressiveness and computability. In terms of expressiveness, we prove that the score captures distinct constraints imprinted in the data, including Verma's and inequalities'. In terms of computability, we show properties of score equivalence and decomposability, which allows, in principle, to break the problem of structural learning in smaller and more manageable pieces. Third, we implement this score using an MCMC sampling algorithm and test its properties in several simulation scenarios.



Paperid:1227
Authors:Yakir Berchenko
Ben-Gurion University of the Negev, Department of Industrial Engineering and Management
Abstract:
A thorough theoretical understanding of the surprising generalization ability of deep networks (and other overparameterized models) is still lacking. Here we demonstrate that simplicity bias is a major phenomenon to be reckoned with in overparameterized machine learning. In addition to explaining the outcome of simplicity bias, we also study its source: following concrete rigorous examples, we argue that (i) simplicity bias can explain generalization in overparameterized learning models such as neural networks; (ii) simplicity bias and excellent generalization are optimizerindependent, as our example shows, and although the optimizer affects training, it is not the driving force behind simplicity bias; (iii) simplicity bias in pre-training models, and subsequent posteriors, is universal and stems from the subtle fact that uniformly-at-random constructed priors are not uniformly-at-random sampled ; and (iv) in neural network models, the biasing mechanism in wide (and shallow) networks is different from the biasing mechanism in deep (and narrow) networks.



Paperid:1228
Authors:Artem Betlei, Mariia Vladimirova, Mehdi Sebbar, Nicolas Urien, Thibaud Rahier, Benjamin Heymann
Criteo AI Lab, France, Criteo AI Lab, France, Criteo Ad Landscape, France, Criteo Ad Landscape, France, Criteo AI Lab, France, Criteo AI Lab, France
Abstract:
The effectiveness of advertising in ecommerce largely depends on the ability of merchants to bid on and win impressions for their targeted users. The bidding procedure is highly complex due to various factors such as market competition, user behavior, and the diverse objectives of advertisers. In this paper we consider the problem at the level of user timelines instead of individual bid requests, manipulating full policies (i.e. pre-defined bidding strategies) and not bid values. In order to optimally allocate policies to users, typical multiple treatments allocation methods solve knapsack-like problems which aim at maximizing an expected value under constraints. In the specific context of online advertising, we argue that optimizing for the probability of success is a more suited objective than expected value maximization, and we introduce the SuccessProbaMax algorithm that aims at finding the policy allocation which is the most likely to outperform a fixed reference policy. Finally, we conduct comprehensive experiments both on synthetic and real-world data to evaluate its performance. The results demonstrate that our proposed algorithm outperforms conventional expected-value maximization algorithms in terms of success rate.



Paperid:1229
Authors:Aritra Bhowmick, Mert Kosan, Zexi Huang, Ambuj Singh, Sourav Medya
New York University, University of California, Santa Barbara, University of California, Santa Barbara, University of California, Santa Barbara, University of Illinois, Chicago
Abstract:
Graph clustering is a fundamental and challenging task in the field of graph mining where the objective is to group the nodes into clusters taking into consideration the topology of the graph. It has several applications in diverse domains spanning social network analysis, recommender systems, computer vision, and bioinformatics. In this work, we propose a novel method, DGCluster, which primarily optimizes the modularity objective using graph neural networks and scales linearly with the graph size. Our method does not require the number of clusters to be specified as a part of the input and can also leverage the availability of auxiliary node level information. We extensively test DGCluster on several realworld datasets of varying sizes, across multiple popular cluster quality metrics. Our approach consistently outperforms the state-of-the-art methods, demonstrating significant performance gains in almost all settings.



Paperid:1230
Authors:Xiao-Dong Bi, Shao-Qun Zhang, Yuan Jiang
Nanjing University, Nanjing University, Nanjing University
Abstract:
Ensemble pruning that combines a subset of individual learners generated in parallel to make predictions is an important topic in ensemble learning. Past decades have developed a lot of pruning algorithms that focus on the external behavior of learners on samples, which may lead to overfitting. In this paper, we conjecture that the generalization performance of an ensemble is not only related to its external behavior on samples but also dependent on the internal structure of individual learners. We propose the general MEPSI approach based on Kolmogorov complexity and the Minimum Description Length (MDL) principle, which formulates the ensemble pruning task as the two-objective optimization problem that comprises the empirical error and structural information among individual learners. We also provide a concrete implementation of MEPSI on decision trees. The theoretical results provide generalization bounds for both the general MEPSI approach and tree-based implementation. The comparative experiments conducted on multiple real-world data sets demonstrate the effectiveness of our proposed method.



Paperid:1231
Authors:Cheng Bian, Xiaoyu Li, Qi Bi, Guangpu Zhu, Jiegeng Lyu, Weile Zhang, Yelei Li, Zijing Zeng
OPPO Health Lab, OPPO Health Lab, OPPO Health Lab, OPPO Health Lab, OPPO Health Lab, OPPO Health Lab, OPPO Health Lab, OPPO Health Lab
Abstract:
Arterial blood pressure (ABP) holds substantial promise for proactive cardiovascular health management. Notwithstanding its potential, the invasive nature of ABP measurements confines their utility primarily to clinical environments, limiting their applicability for continuous monitoring beyond medical facilities. The conversion of photoplethysmography (PPG) signals into ABP equivalents has garnered significant attention due to its potential in revolutionizing cardiovascular disease management. Recent strides in PPGto-ABP prediction encompass the integration of generative and discriminative models. Despite these advances, the efficacy of these models is curtailed by the latent space shift predicament, stemming from alterations in PPG data distribution across disparate hardware and individuals, potentially leading to distorted ABP waveforms. To tackle this problem, we present an innovative solution named the Latent Space Constraint Transformer (LSCT), leveraging a quantized codebook to yield robust latent spaces by employing multiple discretizing bases. To facilitate improved reconstruction, the Correlation-boosted Attention Module (CAM) is introduced to systematically query pertinent bases on a global scale. Furthermore, to enhance expressive capacity, we propose the Multi-Spectrum Enhancement Knowledge (MSEK), which fosters local information flow within the channels of latent code and provides additional embedding for reconstruction. Through comprehensive experimentation on both publicly available datasets and a private downstream task dataset, the proposed approach demonstrates noteworthy performance enhancements compared to existing methods. Extensive ablation studies further substantiate the effectiveness of each introduced module.



Paperid:1232
Authors:Gagan Biradar, Yacine Izza, Elita Lobo, Vignesh Viswanathan, Yair Zick
University of Massachusetts Amherst, National University of Singapore, University of Massachusetts Amherst, University of Massachusetts Amherst, University of Massachusetts Amherst
Abstract:
The recent criticisms of the robustness of post hoc model approximation explanation methods (like LIME and SHAP) have led to the rise of modelprecise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufficient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue --- there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufficient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisfies a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME.



Paperid:1233
Authors:Jacopo Bonato, Francesco Pelosin, Luigi Sabetta, Alessandro Nicolosi
Leonardo Labs, Leonardo Labs Covision Lab, Leonardo Labs, Leonardo Labs
Abstract:
The recent surge of pervasive devices that generate dynamic data streams has underscored the necessity for learning systems to adapt continually to data distributional shifts. To tackle this challenge, the research community has put forth a spectrum of methodologies, including the demanding pursuit of classincremental learning without replay data. In this study, we present MIND, a parameter isolation method that aims to significantly enhance the performance of replay-free solutions and achieve state-of-the-art results on several widely studied datasets. Our approach introduces two main contributions: two alternative distillation procedures that significantly improve the efficiency of MIND increasing the accumulated knowledge of each sub-network, and the optimization of the BachNorm layers across tasks inside the sub-networks. Overall, MIND outperforms all the state-of-the-art methods for rehearsal-free Class-Incremental learning (with an increment in classification accuracy of approx. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx. +40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each contribution to demonstrate its impact on performance improvement. Our results showcase the superior performance of MIND indicating its potential for addressing the challenges posed by Class-incremental and Domain-Incremental learning in resource-constrained environments.



Paperid:1234
Authors:David Bonet, Daniel Mas Montserrat, Xavier Giró-i-Nieto, Alexander G. Ioannidis
Stanford University, Stanford, CA, USA Universitat Politècnica de Catalunya, Barcelona, Spain, Stanford University, Stanford, CA, USA, Amazon, Barcelona, Spain, Stanford University, Stanford, CA, USA
Abstract:
Training deep learning models and performing hyperparameter tuning can be computationally demanding and timeconsuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at https://github.com/AI-sandbox/HyperFast.



Paperid:1235
Authors:Leo Brunswic, Yinchuan Li, Yushun Xu, Yijun Feng, Shangling Jui, Lizhuang Ma
Huawei Shanghai Shanghai Research Center Shanghai Jiaotong University, Huawei Noah’s Ark Lab, Beijing, China, Huawei Shanghai Shanghai Research Center, Huawei Shanghai Shanghai Research Center, Huawei Shanghai Shanghai Research Center, Shanghai Jiaotong University
Abstract:
GFlowNets is a novel flowbased method for learning a stochastic policy to generate objects via a sequence of actions and with probability proportional to a given positive reward. We contribute to relaxing hypotheses limiting the application range of GFlowNets, in particular: acyclicity (or lack thereof). To this end, we extend the theory of GFlowNets on measurable spaces which includes continuous state spaces without cycle restrictions, and provide a generalization of cycles in this generalized context. We show that losses used so far push flows to get stuck into cycles and we define a family of losses solving this issue. Experiments on graphs and continuous tasks validate those principles.



Paperid:1236
Authors:Ruichu Cai, Yuxuan Zhu, Jie Qiao, Zefeng Liang, Furui Liu, Zhifeng Hao
School of Computer Science, Guangdong University of Technology, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, Zhejiang Lab, Hangzhou, China, College of Science, Shantou University, Shantou, China
Abstract:
Deep neural networks (DNNs) have been demonstrated to be vulnerable to wellcrafted adversarial examples, which are generated through either well-conceived L_p-norm restricted or unrestricted attacks. Nevertheless, the majority of those approaches assume that adversaries can modify any features as they wish, and neglect the causal generating process of the data, which is unreasonable and unpractical. For instance, a modification in income would inevitably impact features like the debt-to-income ratio within a banking system. By considering the underappreciated causal generating process, first, we pinpoint the source of the vulnerability of DNNs via the lens of causality, then give theoretical results to answer where to attack. Second, considering the consequences of the attack interventions on the current state of the examples to generate more realistic adversarial examples, we propose CADE, a framework that can generate Counterfactual ADversarial Examples to answer how to attack. The empirical results demonstrate CADE's effectiveness, as evidenced by its competitive performance across diverse attack scenarios, including white-box, transfer-based, and random intervention attacks.



Paperid:1237
Authors:Wanlin Cai, Yuxuan Liang, Xianggen Liu, Jianshuai Feng, Yuankai Wu
Sichuan University, The Hong Kong University of Science and Technology (Guangzhou), Sichuan University, Beijing Institute of Technology, Sichuan University
Abstract:
Multivariate time series forecasting poses an ongoing challenge across various disciplines. Time series data often exhibit diverse intraseries and inter-series correlations, contributing to intricate and interwoven dependencies that have been the focus of numerous studies. Nevertheless, a significant research gap remains in comprehending the varying inter-series correlations across different time scales among multiple time series, an area that has received limited attention in the literature. To bridge this gap, this paper introduces MSGNet, an advanced deep learning model designed to capture the varying inter-series correlations across multiple time scales using frequency domain analysis and adaptive graph convolution. By leveraging frequency domain analysis, MSGNet effectively extracts salient periodic patterns and decomposes the time series into distinct time scales. The model incorporates a self-attention mechanism to capture intra-series dependencies, while introducing an adaptive mixhop graph convolution layer to autonomously learn diverse inter-series correlations within each time scale. Extensive experiments are conducted on several real-world datasets to showcase the effectiveness of MSGNet. Furthermore, MSGNet possesses the ability to automatically learn explainable multi-scale inter-series correlations, exhibiting strong generalization capabilities even when applied to out-of-distribution samples.



Paperid:1238
Authors:Xu Cai, Jonathan Scarlett
National University of Singapore, National University of Singapore
Abstract:
In this paper, we study the problem of estimating the normalizing constant through queries to the blackbox function f, which is the integration of the exponential function of f scaled by a problem parameter lambda. We assume f belongs to a reproducing kernel Hilbert space (RKHS), and show that to estimate the normalizing constant within a small relative error, the level of difficulty depends on the value of lambda: When lambda approaches zero, the problem is similar to Bayesian quadrature (BQ), while when lambda approaches infinity, the problem is similar to Bayesian optimization (BO). More generally, the problem varies between BQ and BO. We find that this pattern holds true even when the function evaluations are noisy, bringing new aspects to this topic. Our findings are supported by both algorithm-independent lower bounds and algorithmic upper bounds, as well as simulation studies conducted on a variety of benchmark functions.



Paperid:1239
Authors:Zicheng Cai, Lei Chen, Peng Liu, Tongtao Ling, Yutao Lai
Guangdong University of Technology Ping An Technology (Shenzhen) Co., Ltd., Guangdong University of Technology, Ping An Technology (Shenzhen) Co., Ltd. The Hong Kong Polytechnic University, Guangdong University of Technology, Guangdong University of Technology
Abstract:
Differentiable Architecture Search (DARTS) has achieved a rapid search for excellent architectures by optimizing architecture parameters through gradient descent. However, this efficiency comes with a significant challenge: the risk of premature convergence to local optima, resulting in subpar performance that falls short of expectations. To address this issue, we propose a novel and effective method called Evolutionary GradientBased Neural Architecture Search (EG-NAS). Our approach combines the strengths of both gradient descent and evolutionary strategy, allowing for the exploration of various optimization directions during the architecture search process. To begin with, we continue to employ gradient descent for updating network parameters to ensure efficiency. Subsequently, to mitigate the risk of premature convergence, we introduce an evolutionary strategy with global search capabilities to optimize the architecture parameters. By leveraging the best of both worlds, our method strikes a balance between efficient exploration and exploitation of the search space. Moreover, we have redefined the fitness function to not only consider accuracy but also account for individual similarity. This inclusion enhances the diversity and accuracy of the optimized directions identified by the evolutionary strategy. Extensive experiments on various datasets and search spaces demonstrate that EG-NAS achieves highly competitive performance at significantly low search costs compared to state-of-the-art methods. The code is available at https://github.com/caicaicheng/EG-NAS.



Paperid:1240
Authors:Meng Cao, Songcan Chen
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence
Abstract:
Domain generalization aims to learn a wellperformed classifier on multiple source domains for unseen target domains under domain shift. Domain-invariant representation (DIR) is an intuitive approach and has been of great concern. In practice, since the targets are variant and agnostic, only a few sources are not sufficient to reflect the entire domain population, leading to biased DIR. Derived from PAC-Bayes framework, we provide a novel generalization bound involving the number of domains sampled from the environment (N) and the radius of the Wasserstein ball centred on the target (r), which have rarely been considered before. Herein, we can obtain two natural and significant findings: when N increases, 1) the gap between the source and target sampling environments can be gradually mitigated; 2) the target can be better approximated within the Wasserstein ball. These findings prompt us to collect adequate domains against domain shift. For seeking convenience, we design a novel yet simple Extrapolation Domain strategy induced by the Mixup scheme, namely EDM. Through a reverse Mixup scheme to generate the extrapolated domains, combined with the interpolated domains, we expand the interpolation space spanned by the sources, providing more abundant domains to increase sampling intersections to shorten r. Moreover, EDM is easy to implement and be plugged-and-played. In experiments, EDM has been plugged into several methods in both closed and open set settings, achieving up to 5.73% improvement.



Paperid:1241
Authors:Abdulkadir Çelikkanat, Nikolaos Nakis, Morten Mørup
Technical University of Denmark, Technical University of Denmark, Technical University of Denmark
Abstract:
Over the past two decades, there has been a tremendous increase in the growth of representation learning methods for graphs, with numerous applications across various fields, including bioinformatics, chemistry, and the social sciences. However, current dynamic network approaches focus on discretetime networks or treat links in continuous-time networks as instantaneous events. Therefore, these approaches have limitations in capturing the persistence or absence of links that continuously emerge and disappear over time for particular durations. To address this, we propose a novel stochastic process relying on survival functions to model the durations of links and their absences over time. This forms a generic new likelihood specification explicitly accounting for intermittent edge-persistent networks, namely GraSSP: Graph Representation with Sequential Survival Process. We apply the developed framework to a recent continuous time dynamic latent distance model characterizing network dynamics in terms of a sequence of piecewise linear movements of nodes in latent space. We quantitatively assess the developed framework in various downstream tasks, such as link prediction and network completion, demonstrating that the developed modeling framework accounting for link persistence and absence well tracks the intrinsic trajectories of nodes in a latent space and captures the underlying characteristics of evolving network structure.



Paperid:1242
Authors:Sungmin Cha, Sungjun Cho, Dasol Hwang, Honglak Lee, Taesup Moon, Moontae Lee
New York University, LG AI Research, LG AI Research, LG AI Research, ASRI / INMC / Seoul National University, LG AI Research University of Illinois at Chicago
Abstract:
Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pretrained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.



Paperid:1243
Authors:Ahmad Chamma, Bertrand Thirion, Denis Engemann
Inria-Saclay, Palaiseau, France Université Paris-Saclay CEA Saclay, Inria-Saclay, Palaiseau, France Université Paris-Saclay CEA Saclay, Roche Pharma Research and Early Development, Neuroscience and Rare Diseases, Roche Innovation Center Basel, F. Hoffmann–La Roche Ltd., Basel, Switzerland
Abstract:
Explaining the decision process of machine learning algorithms is nowadays crucial for both model’s performance enhancement and human comprehension. This can be achieved by assessing the variable importance of single variables, even for highcapacity non-linear methods, e.g. Deep Neural Networks (DNNs). While only removal-based approaches, such as Permutation Importance (PI), can bring statistical validity, they return misleading results when variables are correlated. Conditional Permutation Importance (CPI) bypasses PI’s limitations in such cases. However, in high-dimensional settings, where high correlations between the variables cancel their conditional importance, the use of CPI as well as other methods leads to unreliable results, besides prohibitive computation costs. Grouping variables statistically via clustering or some prior knowledge gains some power back and leads to better interpretations. In this work, we introduce BCPI (Block-Based Conditional Permutation Importance), a new generic framework for variable importance computation with statistical guarantees handling both single and group cases. Furthermore, as handling groups with high cardinality (such as a set of observations of a given modality) are both time-consuming and resource-intensive, we also introduce a new stacking approach extending the DNN architecture with sub-linear layers adapted to the group structure. We show that the ensuing approach extended with stacking controls the type-I error even with highly-correlated groups and shows top accuracy across benchmarks. Furthermore, we perform a real-world data analysis in a large-scale medical dataset where we aim to show the consistency between our results and the literature for a biomarker prediction.



Paperid:1244
Authors:T-H. Hubert Chan, Hao Xie, Mengshi Zhao
The University of Hong Kong, The University of Hong Kong, The University of Hong Kong
Abstract:
We examine a private ADMM variant for (strongly) convex objectives which is a primaldual iterative method. Each iteration has a user with a private function used to update the primal variable, masked by Gaussian noise for local privacy, without directly adding noise to the dual variable. Privacy amplification by iteration explores if noises from later iterations can enhance the privacy guarantee when releasing final variables after the last iteration. Cyffers et al. explored privacy amplification by iteration for the proximal ADMM variant, where a user's entire private function is accessed and noise is added to the primal variable. In contrast, we examine a private ADMM variant requiring just one gradient access to a user's function, but both primal and dual variables must be passed between successive iterations. To apply Balle et al.'s coupling framework to the gradient ADMM variant, we tackle technical challenges with novel ideas. First, we address the non-expansive mapping issue in ADMM iterations by using a customized norm. Second, because the dual variables are not masked with any noise directly, their privacy guarantees are achieved by treating two consecutive noisy ADMM iterations as a Markov operator. Our main result is that the privacy guarantee for the gradient ADMM variant can be amplified proportionally to the number of iterations. For strongly convex objective functions, this amplification exponentially increases with the number of iterations. These amplification results align with the previously studied special case of stochastic gradient descent.



Paperid:1245
Authors:Wonjoon Chang, Dahee Kwon, Jaesik Choi
KAIST, KAIST, KAIST INEEJI
Abstract:
Understanding intermediate representations of the concepts learned by deep learning classifiers is indispensable for interpreting general model behaviors. Existing approaches to reveal learned concepts often rely on human supervision, such as predefined concept sets or segmentation processes. In this paper, we propose a novel unsupervised method for discovering distributed representations of concepts by selecting a principal subset of neurons. Our empirical findings demonstrate that instances with similar neuron activation states tend to share coherent concepts. Based on the observations, the proposed method selects principal neurons that construct an interpretable region, namely a Relaxed Decision Region (RDR), encompassing instances with coherent concepts in the feature space. It can be utilized to identify unlabeled subclasses within data and to detect the causes of misclassifications. Furthermore, the applicability of our method across various layers discloses distinct distributed representations over the layers, which provides deeper insights into the internal mechanisms of the deep learning model.



Paperid:1246
Authors:Guoqing Chao, Yi Jiang, Dianhui Chu
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
Incomplete multiview clustering becomes an important research problem, since multi-view data with missing values are ubiquitous in real-world applications. Although great efforts have been made for incomplete multi-view clustering, there are still some challenges: 1) most existing methods didn't make full use of multi-view information to deal with missing values; 2) most methods just employ the consistent information within multi-view data but ignore the complementary information; 3) For the existing incomplete multi-view clustering methods, incomplete multi-view representation learning and clustering are treated as independent processes, which leads to performance gap. In this work, we proposed a novel Incomplete Contrastive Multi-View Clustering method with high-confidence guiding (ICMVC). Firstly, we proposed a multi-view consistency relation transfer plus graph convolutional network to tackle missing values problem. Secondly, instance-level attention fusion and high-confidence guiding are proposed to exploit the complementary information while instance-level contrastive learning for latent representation is designed to employ the consistent information. Thirdly, an end-to-end framework is proposed to integrate multi-view missing values handling, multi-view representation learning and clustering assignment for joint optimization. Experiments compared with state-of-the-art approaches demonstrated the effectiveness and superiority of our method. Our code is publicly available at https://github.com/liunian-Jay/ICMVC. The version with supplementary material can be found at http://arxiv.org/abs/2312.08697.



Paperid:1247
Authors:Yassine Chemingui, Aryan Deshwal, Trong Nghia Hoang, Janardhan Rao Doppa
Washington State University, Washington State University, Washington State University, Washington State University
Abstract:
Offline optimization is an emerging problem in many experimental engineering domains including protein, drug or aircraft design, where online experimentation to collect evaluation data is too expensive or dangerous. To avoid that, one has to optimize an unknown function given only its offline evaluation at a fixed set of inputs. A naive solution to this problem is to learn a surrogate model of the unknown function and optimize this surrogate instead. However, such a naive optimizer is prone to erroneous overestimation of the surrogate (possibly due to overfitting on a biased sample of function evaluation) on inputs outside the offline dataset. Prior approaches addressing this challenge have primarily focused on learning robust surrogate models. However, their search strategies are derived from the surrogate model rather than the actual offline data. To fill this important gap, we introduce a new learning-to-search perspective for offline optimization by reformulating it as an offline reinforcement learning problem. Our proposed policy-guided gradient search approach explicitly learns the best policy for a given surrogate model created from the offline data. Our empirical results on multiple benchmarks demonstrate that the learned optimization policy can be combined with existing offline surrogates to significantly improve the optimization performance.



Paperid:1248
Authors:Chao Chen, Jiacheng Xu, Weijian Liao, Hao Ding, Zongzhang Zhang, Yang Yu, Rui Zhao
National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, Tencent Robotics X, Shenzhen, China
Abstract:
Visual Reinforcement Learning (RL) is a promising approach to achieve humanlike intelligence. However, it currently faces challenges in learning efficiently within noisy environments. In contrast, humans can quickly identify task-relevant objects in distraction-filled surroundings by applying previously acquired common knowledge. Recently, foundational models in natural language processing and computer vision have achieved remarkable successes, and the common knowledge within these models can significantly benefit downstream task training. Inspired by these achievements, we aim to incorporate common knowledge from foundational models into visual RL. We propose a novel Focus-Then-Decide (FTD) framework, allowing the agent to make decisions based solely on task-relevant objects. To achieve this, we introduce an attention mechanism to select task-relevant objects from the object set returned by a foundational segmentation model, and only use the task-relevant objects for the subsequent training of the decision module. Additionally, we specifically employed two generic self-supervised objectives to facilitate the rapid learning of this attention mechanism. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that our method can quickly and accurately pinpoint objects of interest in noisy environments. Consequently, it achieves a significant performance improvement over current state-of-the-art algorithms. Project Page: https://www.lamda.nju.edu.cn/chenc/FTD.html Code: https://github.com/LAMDA-RL/FTD



Paperid:1249
Authors:Dong Chen, Yueting Zhuang, Shuo Zhang, Jinfeng Liu, Su Dong, Siliang Tang
Zhejiang University, Zhejiang University, Zhejiang University, Ant Group, Ant Group, Zhejiang University
Abstract:
Pretrained large models, particularly large language models, have garnered increasing attention, as they have demonstrated remarkable abilities through contextual learning. Pretrained large models are increasingly recognized as fundamental tools for solving various tasks. However, the substantial computational demands of large models have dissuaded most product teams and individuals from running them. In such scenarios, to leverage the exceptional performance of large models, one must solely depend on costly APIs, further burdening product teams and individuals. On the other hand, despite the overall inferior performance of small models compared to large models, there are certain distributions where small models can achieve comparable or even superior results. For instance, during training, small models may become trapped in a local optimum that is unique to certain distributions, leading to superior performance. Hence, we propose Data Shunt (DS), a general paradigm for collaboration of small and large models. DS not only substantially reduces the cost associated with deploying large models but also effectively enhances overall performance. Specifically, DS determines the shunting direction by evaluating the confidence level of small models. When the confidence level falls below a specific threshold, the input data is forwarded to large models. To further leverage the advantages of the small and large models, we introduce Prompt Pruning (PP) and 2Stage Confidence Distillation (2CD), which facilitate mutual collaboration, leading to better results and less cost. The remarkable performance across diverse modalities and tasks demonstrates the superiority of the proposed DS over large models. For instance, ChatGPT achieves an accuracy of 94.43% on Amazon Product sentiment analysis, and DS achieves an accuracy of 95.64%, while the cost has been reduced to only 31.18%. The code for the proposed method are provided for research purposes https://github.com/Anfeather/Data-Shunt.



Paperid:1250
Authors:Dong Chen, Ning Liu, Yichen Zhu, Zhengping Che, Rui Ma, Fachao Zhang, Xiaofeng Mou, Yi Chang, Jian Tang
Jilin University Midea Group, Midea Group, Midea Group, Midea Group, Jilin University, Midea Group, Midea Group, Jilin University, Midea Group
Abstract:
Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work `Prune, then Distill' reveals that a pruned studentfriendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network's trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques.



Paperid:1251
Authors:E Chen, Yang Cao, Yifei Ge
Zhejiang Lab, Hokkaido University, Xi'an Jiaotong-Liverpool University
Abstract:
The shuffle model of local differential privacy is an advanced method of privacy amplification designed to enhance privacy protection with high utility. It achieves this by randomly shuffling sensitive data, making linking individual data points to specific individuals more challenging. However, most existing studies have focused on the shuffle model based on (ε0,0)Locally Differentially Private (LDP) randomizers, with limited consideration for complex scenarios such as (ε0,δ0)-LDP or personalized LDP (PLDP). This hinders a comprehensive understanding of the shuffle model's potential and limits its application in various settings. To bridge this research gap, we propose a generalized shuffle framework that can be applied to PLDP setting. This generalization allows for a broader exploration of the privacy-utility trade-off and facilitates the design of privacy-preserving analyses in diverse contexts. We prove that the shuffled PLDP process approximately preserves μ-Gaussian Differential Privacy with μ = O(1/√n). This approach allows us to avoid the limitations and potential inaccuracies associated with inequality estimations. To strengthen the privacy guarantee, we improve the lower bound by utilizing hypothesis testing instead of relying on rough estimations like the Chernoff bound or Hoeffding's inequality. Furthermore, extensive comparative evaluations clearly show that our approach outperforms existing methods in achieving strong central privacy guarantees while preserving the utility of the global model. We have also carefully designed corresponding algorithms for average function, frequency estimation, and stochastic gradient descent.



Paperid:1252
Authors:Guangyao Chen, Peixi Peng, Yangru Huang, Mengyue Geng, Yonghong Tian
Peking University, Peking University Peng Cheng Laboratory, Peking University, Peking University, Peking University Peng Cheng Laboratory
Abstract:
One important desideratum of lifelong learning aims to discover novel classes from unlabelled data in a continuous manner. The central challenge is twofold: discovering and learning novel classes while mitigating the issue of catastrophic forgetting of established knowledge. To this end, we introduce a new paradigm called Adaptive Discovering and Merging (ADM) to discover novel categories adaptively in the incremental stage and integrate novel knowledge into the model without affecting the original knowledge. To discover novel classes adaptively, we decouple representation learning and novel class discovery, and use Triple Comparison (TC) and Probability Regularization (PR) to constrain the probability discrepancy and diversity for adaptive category assignment. To merge the learned novel knowledge adaptively, we propose a hybrid structure with base and novel branches named Adaptive Model Merging (AMM), which reduces the interference of the novel branch on the old classes to preserve the previous knowledge, and merges the novel branch to the base model without performance loss and parameter growth. Extensive experiments on several datasets show that ADM significantly outperforms existing classincremental Novel Class Discovery (class-iNCD) approaches. Moreover, our AMM also benefits the class-incremental Learning (class-IL) task by alleviating the catastrophic forgetting problem. The source code is included in the supplementary materials.



Paperid:1253
Authors:Haokun Chen, Yao Zhang, Denis Krompass, Jindong Gu, Volker Tresp
LMU Munich Siemens AG, LMU Munich Munich Center for Machine Learning (MCML), Siemens AG, University of Oxford, LMU Munich Munich Center for Machine Learning (MCML)
Abstract:
Recently, foundation models have exhibited remarkable advancements in multimodal learning. These models, equipped with millions (or billions) of parameters, typically require a substantial amount of data for finetuning. However, collecting and centralizing training data from diverse sectors becomes challenging due to distinct privacy regulations. Federated Learning (FL) emerges as a promising solution, enabling multiple clients to collaboratively train neural networks without centralizing their local data. To alleviate client computation burdens and communication overheads, previous works have adapted Parameter-efficient Finetuning (PEFT) methods for FL. Hereby, only a small fraction of the model parameters are optimized and communicated during federated communications. Nevertheless, most previous works have focused on a single modality and neglected one common phenomenon, i.e., the presence of data heterogeneity across the clients. Therefore, in this work, we propose a finetuning framework tailored to heterogeneous multi-modal FL, called Federated Dual-Aadapter Teacher (FedDAT). Specifically, our approach leverages a Dual-Adapter Teacher (DAT) to address data heterogeneity by regularizing the client local updates and applying Mutual Knowledge Distillation (MKD) for an efficient knowledge transfer. FedDAT is the first approach that enables an efficient distributed finetuning of foundation models for a variety of heterogeneous Vision-Language tasks. To demonstrate its effectiveness, we conduct extensive experiments on four multi-modality FL benchmarks with different types of data heterogeneity, where FedDAT substantially outperforms the existing centralized PEFT methods adapted for FL.



Paperid:1254
Authors:Hui Chen, Yinxu Jia, Guanghui Wang, Changliang Zou
Jiangsu Normal University, Nankai University, East China Normal University, Nankai University
Abstract:
Accurately detecting multiple changepoints is critical for various applications, but determining the optimal number of change-points remains a challenge. Existing approaches based on information criteria attempt to balance goodness-of-fit and model complexity, but their performance varies depending on the model. Recently, data-driven selection criteria based on cross-validation has been proposed, but these methods can be prone to slight overfitting in finite samples. In this paper, we introduce a method that controls the probability of overestimation and provides uncertainty quantification for learning multiple change-points via cross-validation. We frame this problem as a sequence of model comparison problems and leverage high-dimensional inferential procedures. We demonstrate the effectiveness of our approach through experiments on finite-sample data, showing superior uncertainty quantification for overestimation compared to existing methods. Our approach has broad applicability and can be used in diverse change-point models.



Paperid:1255
Authors:Jiaxuan Chen, Yu Qi, Yueming Wang, Gang Pan
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
How our brain encodes complex concepts has been a longstanding mystery in neuroscience. The answer to this problem can lead to new understandings about how the brain retrieves information in largescale data with high efficiency and robustness. Neuroscience studies suggest the brain represents concepts in a locality-sensitive hashing (LSH) strategy, i.e., similar concepts will be represented by similar responses. This finding has inspired the design of similarity-based algorithms, especially in contrastive learning. Here, we hypothesize that the brain and large neural network models, both using similarity-based learning rules, could contain a similar semantic embedding space. To verify that, this paper proposes a functional Magnetic Resonance Imaging (fMRI) semantic learning network named BrainSem, aimed at seeking a joint semantic latent space that bridges the brain and a Contrastive Language-Image Pre-training (CLIP) model. Given that our perception is inherently cross-modal, we introduce a fuzzy (one-to-many) matching loss function to encourage the models to extract high-level semantic components from neural signals. Our results claimed that using only a small set of fMRI recordings for semantic space alignment, we could obtain shared embedding valid for unseen categories out of the training set, which provided potential evidence for the semantic representation similarity between the brain and large neural networks. In a zero-shot classification task, our BrainSem achieves an 11.6% improvement over the state-of-the-art.



Paperid:1256
Authors:Jiayi Chen, Aidong Zhang
University of Virginia, University of Virginia
Abstract:
There has been growing concern regarding data privacy during the development and deployment of Multimodal Foundation Models for Artificial General Intelligence (AGI), while Federated Learning (FL) allows multiple clients to collaboratively train models in a privacypreserving manner. This paper formulates and studies Modality-task Agnostic Federated Learning (AFL) to pave the way toward privacy-preserving AGI. A unique property of AFL is the asymmetrical knowledge relationships among clients due to modality gaps, task gaps, and domain shifts between clients. This raises a challenge in learning an optimal inter-client information-sharing scheme that maximizes positive transfer and minimizes negative transfer for AFL. However, prior FL methods, mostly focusing on symmetrical knowledge transfer, tend to exhibit insufficient positive transfer and fail to fully avoid negative transfer during inter-client collaboration. To address this issue, we propose DisentAFL, which leverages a two-stage Knowledge Disentanglement and Gating mechanism to explicitly decompose the original asymmetrical inter-client information-sharing scheme into several independent symmetrical inter-client information-sharing schemes, each of which corresponds to certain semantic knowledge type learned from the local tasks. Experimental results demonstrate the superiority of our method on AFL than baselines.



Paperid:1257
Authors:Jiayu Chen, Zelai Xu, Yunfei Li, Chao Yu, Jiaming Song, Huazhong Yang, Fei Fang, Yu Wang, Yi Wu
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University, Luma AI, Tsinghua University, Carnegie Mellon University, Tsinghua University, Tsinghua University Shanghai Qi Zhi Institute
Abstract:
Learning Nash equilibrium (NE) in complex zerosum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames –games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-neurips.



Paperid:1258
Authors:Jinqian Chen, Jihua Zhu, Qinghai Zheng, Zhongyu Li, Zhiqiang Tian
School of Software Engineering, Xi'an Jiaotong University Shaanxi Joint Key Laboratory for Artificial Intelligence, China, School of Software Engineering, Xi'an Jiaotong University Shaanxi Joint Key Laboratory for Artificial Intelligence, China, College of Computer and Data Science, Fuzhou University, School of Software Engineering, Xi'an Jiaotong University Shaanxi Joint Key Laboratory for Artificial Intelligence, China, School of Software Engineering, Xi'an Jiaotong University Shaanxi Joint Key Laboratory for Artificial Intelligence, China
Abstract:
Federated learning encounters substantial challenges with heterogeneous data, leading to performance degradation and convergence issues. While considerable progress has been achieved in mitigating such an impact, the reliability aspect of federated models has been largely disregarded. In this study, we conduct extensive experiments to investigate the reliability of both generic and personalized federated models. Our exploration uncovers a significant finding: federated models exhibit unreliability when faced with heterogeneous data, demonstrating poor calibration on indistribution test data and low uncertainty levels on out-of-distribution data. This unreliability is primarily attributed to the presence of biased projection heads, which introduce miscalibration into the federated models. Inspired by this observation, we propose the "Assembled Projection Heads" (APH) method for enhancing the reliability of federated models. By treating the existing projection head parameters as priors, APH randomly samples multiple initialized parameters of projection heads from the prior and further performs targeted fine-tuning on locally available data under varying learning rates. Such a head ensemble introduces parameter diversity into the deterministic model, eliminating the bias and producing reliable predictions via head averaging. We evaluate the effectiveness of the proposed APH method across three prominent federated benchmarks. Experimental results validate the efficacy of APH in model calibration and uncertainty estimation. Notably, APH can be seamlessly integrated into various federated approaches but only requires less than 30% additional computation cost for 100x inferences within large models.



Paperid:1259
Authors:Junjie Chen, Jiahao Li, Chen Song, Bin Li, Qingcai Chen, Hongchang Gao, Wendy Hui Wang, Zenglin Xu, Xinghua Shi
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Temple University, Temple University, Harbin Institute of Technology, Shenzhen, Temple University, Stevens Institute of Technology, Harbin Institute of Technology, Shenzhen, Temple University
Abstract:
Improving the diversity of Artificial Intelligence Generated Content (AIGC) is one of the fundamental problems in the theory of generative models such as generative adversarial networks (GANs). Previous studies have demonstrated that the discriminator in GANs should have high capacity and robustness to achieve the diversity of generated data. However, a discriminator with high capacity tends to overfit and guide the generator toward collapsed equilibrium. In this study, we propose a novel discriminative forest GAN, named ForestGAN, that replaces the discriminator to improve the capacity and robustness for modeling statistics in real-world data distribution. A discriminative forest is composed of multiple independent discriminators built on bootstrapped data. We prove that a discriminative forest has a generalization error bound, which is determined by the strength of individual discriminators and the correlations among them. Hence, a discriminative forest can provide very large capacity without any risk of overfitting, which subsequently improves the generative diversity. With the discriminative forest framework, we significantly improved the performance of AutoGAN with a new record FID of 19.27 from 30.71 on STL10 and improved the performance of StyleGAN2-ADA with a new record FID of 6.87 from 9.22 on LSUN-cat.



Paperid:1260
Authors:Junren Chen, Zhaoqiang Liu
University of Hong Kong, University of Electronic Science and Technology of China
Abstract:
In this work, we focus on highdimensional single index models with non-Gaussian sensing vectors and generative priors. More specifically, our goal is to estimate the underlying signal from i.i.d. realizations of the semi-parameterized single index model, where the underlying signal is contained in (up to a constant scaling) the range of a Lipschitz continuous generative model with bounded low-dimensional inputs, the sensing vector follows a non-Gaussian distribution, the noise is a random variable that is independent of the sensing vector, and the unknown non-linear link function is differentiable. Using the first- and second-order Stein's identity, we introduce efficient algorithms to obtain estimated vectors that achieve the near-optimal statistical rate. Experimental results on image datasets are provided to support our theory.



Paperid:1261
Authors:Liangwei Chen, Xiren Zhou, Huanhuan Chen
School of Data Science, University of Science and Technology of China, School of Computer Science and Technology, University of Science and Technology of China, School of Computer Science and Technology, University of Science and Technology of China
Abstract:
With the rapid growth of audio data, there's a pressing need for automatic audio classification. As a type of timeseries data, audio exhibits waveform fluctuations in both the time and frequency domains that evolve over time, with similar instances sharing consistent patterns. This study introduces the Audio Scanning Network (ASNet), designed to leverage abundant information for achieving stable and effective audio classification. ASNet captures real-time changes in audio waveforms across both time and frequency domains through reservoir computing, supported by Reservoir Kernel Canonical Correlation Analysis (RKCCA) to explore correlations between time-domain and frequency-domain waveform fluctuations. This innovative approach empowers ASNet to comprehensively capture the changes and inherent correlations within the audio waveform, and without the need for time-consuming iterative training. Instead of converting audio into spectrograms, ASNet directly utilizes audio feature sequences to uncover associations between time and frequency fluctuations. Experiments on environmental sound and music genre classification tasks demonstrate ASNet's comparable performance to state-of-the-art methods.



Paperid:1262
Authors:Mulin Chen, Bocheng Wang, Xuelong Li
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Graph Convolutional Network (GCN) has exhibited remarkable potential in improving graphbased clustering. To handle the general clustering scenario without a prior graph, these models estimate an initial graph beforehand to apply GCN. Throughout the literature, we have witnessed that 1) most models focus on the initial graph while neglecting the original features. Therefore, the discriminability of the learned representation may be corrupted by a low-quality initial graph; 2) the training procedure lacks effective clustering guidance, which may lead to the incorporation of clustering-irrelevant information into the learned graph. To tackle these problems, the Deep Contrastive Graph Learning (DCGL) model is proposed for general data clustering. Specifically, we establish a pseudo-siamese network, which incorporates auto-encoder with GCN to emphasize both the graph structure and the original features. On this basis, feature-level contrastive learning is introduced to enhance the discriminative capacity, and the relationship between samples and centroids is employed as the clustering-oriented guidance. Afterward, a two-branch graph learning mechanism is designed to extract the local and global structural relationships, which are further embedded into a unified graph under the cluster-level contrastive guidance. Experimental results on several benchmark datasets demonstrate the superiority of DCGL against state-of-the-art algorithms.



Paperid:1263
Authors:Shuo Chen, Jiaying Peng, Xiaolong Li, Yao Zhao
Beijing Jiaotong University, Institute of Information Science, Capital Normal University, School of Mathematical Sciences, Beijing Jiaotong University, Institute of Information Science, Beijing Jiaotong University
Abstract:
Traditional gradient descent (GD) has been fully investigated for convex or Lsmoothness functions, and it is widely utilized in current neural network optimization. The classical descent lemma ensures that for a function with L-smoothness, the GD trajectory converges stably towards the minimum when the learning rate is below 2 / L. This convergence is marked by a consistent reduction in the loss function throughout the iterations. However, recent experimental studies have demonstrated that even when the L-smoothness condition is not met, or if the learning rate is increased leading to oscillations in the loss function during iterations, the GD trajectory still exhibits convergence over the long run. This phenomenon is referred to as the unstable convergence regime of GD. In this paper, we present a theoretical perspective to offer a qualitative analysis of this phenomenon. The unstable convergence is in fact an inherent property of GD for general twice differentiable functions. Specifically, the forwardinvariance of GD is established, i.e., it ensures that any point within a local region will always remain within this region under GD iteration. Then, based on the forward-invariance, for the initialization outside an open set containing the local minimum, the loss function will oscillate at the first several iterations and then become monotonely decreasing after the GD trajectory jumped into the open set. This work theoretically clarifies the unstable convergence phenomenon of GD discussed in previous experimental works. The unstable convergence of GD mainly depends on the selection of the initialization, and it is actually inevitable due to the complex nature of loss function.



Paperid:1264
Authors:Taicai Chen, Yue Duan, Dong Li, Lei Qi, Yinghuan Shi, Yang Gao
National Key Laboratory for Novel Software Technology, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China, Huawei Noah's Ark Lab, School of Computer Science and Engineering, Southeast University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China
Abstract:
Variational Autoencoder based Bayesian Optimization (VAEBO) has demonstrated its excellent performance in addressing high-dimensional structured optimization problems. However, current mainstream methods overlook the potential of utilizing a pool of unlabeled data to construct the latent space, while only concentrating on designing sophisticated models to leverage the labeled data. Despite their effective usage of labeled data, these methods often require extra network structures, additional procedure, resulting in computational inefficiency. To address this issue, we propose a novel method to effectively utilize unlabeled data with the guidance of labeled data. Specifically, we tailor the pseudo-labeling technique from semi-supervised learning to explicitly reveal the relative magnitudes of optimization objective values hidden within the unlabeled data. Based on this technique, we assign appropriate training weights to unlabeled data to enhance the construction of a discriminative latent space. Furthermore, we treat the VAE encoder and the Gaussian Process (GP) in Bayesian optimization as a unified deep kernel learning process, allowing the direct utilization of labeled data, which we term as Gaussian Process guidance. This directly and effectively integrates the goal of improving GP accuracy into the VAE training, thereby guiding the construction of the latent space. The extensive experiments demonstrate that our proposed method outperforms existing VAE-BO algorithms in various optimization scenarios. Our code will be published at https://github.com/TaicaiChen/PG-LBO.



Paperid:1265
Authors:Wentse Chen, Shiyu Huang, Yuan Chiang, Tim Pearce, Wei-Wei Tu, Ting Chen, Jun Zhu
Carnegie Mellon University, 4Paradigm Inc., Tsinghua University, Microsoft Research, 4Paradigm Inc., Tsinghua University, Tsinghua University
Abstract:
Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose DiversityGuided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.



Paperid:1266
Authors:Xi Chen, Chang Gao, Zuowen Wang, Longbiao Cheng, Sheng Zhou, Shih-Chii Liu, Tobi Delbruck
Institute of Neuroinformatics, UZH and ETH Zurich, Department of Microelectronics, Delft University of Technology, Institute of Neuroinformatics, UZH and ETH Zurich, Institute of Neuroinformatics, UZH and ETH Zurich, Institute of Neuroinformatics, UZH and ETH Zurich, Institute of Neuroinformatics, UZH and ETH Zurich, Institute of Neuroinformatics, UZH and ETH Zurich
Abstract:
Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which require hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results show a reduction of ∼80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 210X speedup in matrix computations for an activation sparsity range of 50%-90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources.



Paperid:1267
Authors:Yang Chen, Xiao Lin, Bo Yan, Libo Zhang, Jiamou Liu, Neset Özkan Tan, Michael Witbrock
NAOInstitute, University of Auckland, New Zealand School of Computer Science, University of Auckland, New Zealand, School of Computer Science, Beijing Institute of Technology, Beijing, China, School of Computer Science, Beijing Institute of Technology, Beijing, China, School of Computer Science, University of Auckland, New Zealand, School of Computer Science, University of Auckland, New Zealand, NAOInstitute, University of Auckland, New Zealand School of Computer Science, University of Auckland, New Zealand, NAOInstitute, University of Auckland, New Zealand School of Computer Science, University of Auckland, New Zealand
Abstract:
Designing suitable reward functions for numerous interacting intelligent agents is challenging in realworld applications. Inverse reinforcement learning (IRL) in mean field games (MFGs) offers a practical framework to infer reward functions from expert demonstrations. While promising, the assumption of agent homogeneity limits the capability of existing methods to handle demonstrations with heterogeneous and unknown objectives, which are common in practice. To this end, we propose a deep latent variable MFG model and an associated IRL method. Critically, our method can infer rewards from different yet structurally similar tasks without prior knowledge about underlying contexts or modifying the MFG model itself. Our experiments, conducted on simulated scenarios and a real-world spatial taxi-ride pricing problem, demonstrate the superiority of our approach over state-of-the-art IRL methods in MFGs.



Paperid:1268
Authors:Yiding Chen, Xuezhou Zhang, Qiaomin Xie, Xiaojin Zhu
University of Wisconsin-Madison, Boston University, University of Wisconsin-Madison, University of Wisconsin-Madison
Abstract:
We study offline reinforcement learning (RL) with heavytailed reward distribution and data corruption: (i) Moving beyond subGaussian reward distribution, we allow the rewards to have infinite variances; (ii) We allow corruptions where an attacker can arbitrarily modify a small fraction of the rewards and transitions in the dataset. We first derive a sufficient optimality condition for generalized Pessimistic Value Iteration (PEVI), which allows various estimators with proper confidence bounds and can be applied to multiple learning settings. In order to handle the data corruption and heavy-tailed reward setting, we prove that the trimmed-mean estimation achieves the minimax optimal error rate for robust mean estimation under heavy-tailed distributions. In the PEVI algorithm, we plug in the trimmed mean estimation and the confidence bound to solve the robust offline RL problem. Standard analysis reveals that data corruption induces a bias term in the suboptimality gap, which gives the false impression that any data corruption prevents optimal policy learning. By using the optimality condition for the generalized PEVI, we show that as long as the bias term is less than the ``action gap'', the policy returned by PEVI achieves the optimal value given sufficient data.



Paperid:1269
Authors:Yiming Chen, Haiwei Wu, Jiantao Zhou
University of Macau, University of Macau, University of Macau
Abstract:
Deep Neural Networks (DNN) are susceptible to backdoor attacks where malicious attackers manipulate the model's predictions via data poisoning. It is hence imperative to develop a strategy for training a clean model using a potentially poisoned dataset. Previous trainingtime defense mechanisms typically employ an one-time isolation process, often leading to suboptimal isolation outcomes. In this study, we present a novel and efficacious defense method, termed Progressive Isolation of Poisoned Data (PIPD), that progressively isolates poisoned data to enhance the isolation accuracy and mitigate the risk of benign samples being misclassified as poisoned ones. Once the poisoned portion of the dataset has been identified, we introduce a selective training process to train a clean model. Through the implementation of these techniques, we ensure that the trained model manifests a significantly diminished attack success rate against the poisoned data. Extensive experiments on multiple benchmark datasets and DNN models, assessed against nine state-of-the-art backdoor attacks, demonstrate the superior performance of our PIPD method for backdoor defense. For instance, our PIPD achieves an average True Positive Rate (TPR) of 99.95% and an average False Positive Rate (FPR) of 0.06% for diverse attacks over CIFAR-10 dataset, markedly surpassing the performance of state-of-the-art methods. The code is available at https://github.com/RorschachChen/PIPD.git.



Paperid:1270
Authors:Ying-Yu Chen, Jun-Wei Hsieh, Xin Li, Ming-Ching Chang
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, University at Albany - SUNY, University at Albany - SUNY
Abstract:
Due to the scarcity of training samples, FewShot Learning (FSL) poses a significant challenge to capture discriminative object features effectively. The combination of transfer learning and meta-learning has recently been explored by pre-training the backbone features using labeled base data and subsequently fine-tuning the model with target data. However, existing meta-learning methods, which use embedding networks, suffer from scaling limitations when dealing with a few labeled samples, resulting in suboptimal results. Inspired by the latest advances in FSL, we further advance the approach of fine-tuning a pre-trained architecture by a strengthened hierarchical feature representation. The technical contributions of this work include: 1) a hybrid design named Intra-Block Fusion (IBF) to strengthen the extracted features within each convolution block; and 2) a novel Cross-Scale Attention (CSA) module to mitigate the scaling inconsistencies arising from the limited training samples, especially for cross-domain tasks. We conducted comprehensive evaluations on standard benchmarks, including three in-domain tasks (miniImageNet, CIFAR-FS, and FC100), as well as two cross-domain tasks (CDFSL and Meta-Dataset). The results have improved significantly over existing state-of-the-art approaches on all benchmark datasets. In particular, the FSL performance on the in-domain FC100 dataset is more than three points better than the latest PMF (Hu et al. 2022).



Paperid:1271
Authors:Yiyue Chen, Haris Vikalo, Chianing Wang
The University of Texas at Austin, The University of Texas at Austin, Toyota InfoTech Lab USA
Abstract:
Motivated by high resource costs of centralized machine learning schemes as well as data privacy concerns, federated learning (FL) emerged as an efficient alternative that relies on aggregating locally trained models rather than collecting clients' potentially private data. In practice, available resources and data distributions vary from one client to another, creating an inherent system heterogeneity that leads to deterioration of the performance of conventional FL algorithms. In this work, we present a federated quantizationbased self-supervised learning scheme (Fed-QSSL) designed to address heterogeneity in FL systems. At clients' side, to tackle data heterogeneity we leverage distributed self-supervised learning while utilizing low-bit quantization to satisfy constraints imposed by local infrastructure and limited communication resources. At server's side, Fed-QSSL deploys de-quantization, weighted aggregation and re-quantization, ultimately creating models personalized to both data distribution as well as specific infrastructure of each client's device. We validated the proposed algorithm on real world datasets, demonstrating its efficacy, and theoretically analyzed impact of low-bit training on the convergence and robustness of the learned models.



Paperid:1272
Authors:Yuzhou Chen, Jose Frias, Yulia R. Gel
Temple University, UNAM, University of Texas at Dallas National Science Foundation
Abstract:
Graph contrastive learning (GCL) has recently emerged as a new concept which allows for capitalizing on the strengths of graph neural networks (GNNs) to learn rich representations in a wide variety of applications which involve abundant unlabeled information. However, existing GCL approaches largely tend to overlook the important latent information on higherorder graph substructures. We address this limitation by introducing the concepts of topological invariance and extended persistence on graphs to GCL. In particular, we propose a new contrastive mode which targets topological representations of the two augmented views from the same graph, yielded by extracting latent shape properties of the graph at multiple resolutions. Along with the extended topological layer, we introduce a new extended persistence summary, namely, extended persistence landscapes (EPL) and derive its theoretical stability guarantees. Our extensive numerical results on biological, chemical, and social interaction graphs show that the new Topological Graph Contrastive Learning (TopoGCL) model delivers significant performance gains in unsupervised graph classification for 8 out of 12 considered datasets and also exhibits robustness under noisy scenarios.



Paperid:1273
Authors:Zhiqiang Chen, Yang Chen, Xiaolong Zou, Shan Yu
Beijing Academy of Artificial Intelligence, Beijing, China Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation (CASIA), Beijing, China, Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation (CASIA), Beijing, China, Qiyuan Lab, Beijing, China, Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation (CASIA), Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China
Abstract:
Neural population coding can represent continuous information by neurons with a series of discrete preferred stimuli, and we find that the bellshaped tuning curve plays an important role in this mechanism. Inspired by this, we incorporate a bell-shaped tuning curve into the discrete group convolution to achieve continuous group equivariance. Simply, we modulate group convolution kernels by Gauss functions to obtain bell-shaped tuning curves. Benefiting from the modulation, kernels also gain smooth gradients on geometric dimensions (e.g., location dimension and orientation dimension). It allows us to generate group convolution kernels from sparse weights with learnable geometric parameters, which can achieve both competitive performances and parameter efficiencies. Furthermore, we quantitatively prove that discrete group convolutions with proper tuning curves (bigger than 1x sampling step) can achieve continuous equivariance. Experimental results show that 1) our approach achieves very competitive performances on MNIST-rot with at least 75% fewer parameters compared with previous SOTA methods, which is efficient in parameter; 2) Especially with small sample sizes, our approach exhibits more pronounced performance improvements (up to 24%); 3) It also has excellent rotation generalization ability on various datasets such as MNIST, CIFAR, and ImageNet with both plain and ResNet architectures.



Paperid:1274
Authors:Ziliang Chen, Yongsen Zheng, Zhao-Rong Lai, Quanlong Guan, Liang Lin
Jinan University Pazhou Lab, Sun Yat-sen University, Jinan University, Jinan University, Sun Yat-sen University
Abstract:
Invariant representation learning (IRL) encourages the prediction from invariant causal features to labels deconfounded from the environments, advancing the technical roadmap of outof-distribution (OOD) generalization. Despite spotlights around, recent theoretical result verified that some causal features recovered by IRLs merely pretend domain-invariantly in the training environments but fail in unseen domains. The fake invariance severely endangers OOD generalization since the trustful objective can not be diagnosed and existing causal remedies are invalid to rectify. In this paper, we review a IRL family (InvRat) under the Partially and Fully Informative Invariant Feature Structural Causal Models (PIIF SCM /FIIF SCM) respectively, to certify their weaknesses in representing fake invariant features, then, unify their causal diagrams to propose ReStructured SCM (RS-SCM). RS-SCM can ideally rebuild the spurious and the fake invariant features simultaneously. Given this, we further develop an approach based on conditional mutual information with respect to RS-SCM, then rigorously rectify the spurious and fake invariant effects. It can be easily implemented by a small feature selection subnet introduced in the IRL family, which is alternatively optimized to achieve our goal. Experiments verified the superiority of our approach to fight against the fake invariant issue across a variety of OOD generalization benchmarks.



Paperid:1275
Authors:Debo Cheng, Ziqi Xu, Jiuyong Li, Lin Liu, Jixue Liu, Wentao Gao, Thuc Duy Le
University of South Australia, University of South Australia, University of South Australia, University of South Australia, University of South Australia, University of South Australia, University of South Australia
Abstract:
Causal inference from longitudinal observational data is a challenging problem due to the difficulty in correctly identifying the timedependent confounders, especially in the presence of latent time-dependent confounders. Instrumental variable (IV) is a powerful tool for addressing the latent confounders issue, but the traditional IV technique cannot deal with latent time-dependent confounders in longitudinal studies. In this work, we propose a novel Time-dependent Instrumental Factor Model (TIFM) for time-varying causal effect estimation from data with latent time-dependent confounders. At each time-step, the proposed TIFM method employs the Recurrent Neural Network (RNN) architecture to infer latent IV, and then uses the inferred latent IV factor for addressing the confounding bias caused by the latent time-dependent confounders. We provide a theoretical analysis for the proposed TIFM method regarding causal effect estimation in longitudinal data. Extensive evaluation with synthetic datasets demonstrates the effectiveness of TIFM in addressing causal effect estimation over time. We further apply TIFM to a climate dataset to showcase the potential of the proposed method in tackling real-world problems.



Paperid:1276
Authors:Ji Cheng, Bo Xue, Jiaxiang Yi, Qingfu Zhang
Department of Computer Science, City University of Hong Kong The City University of Hong Kong Shenzhen Research Institute, Department of Computer Science, City University of Hong Kong The City University of Hong Kong Shenzhen Research Institute, Department of Material Engineering, Delft University of Technology, Department of Computer Science, City University of Hong Kong The City University of Hong Kong Shenzhen Research Institute
Abstract:
Multiobjective Stochastic Linear bandit (MOSLB) plays a critical role in the sequential decision-making paradigm, however, most existing methods focus on the Pareto dominance among different objectives without considering any priority. In this paper, we study bandit algorithms under mixed Pareto-lexicographic orders, which can reflect decision makers' preferences. We adopt the Grossone approach to deal with these orders and develop the notion of Pareto-lexicographic optimality to evaluate the learners' performance. Our work represents a first attempt to address these important and realistic orders in bandit algorithms. To design algorithms under these orders, the upper confidence bound (UCB) policy and the prior free lexicographical filter are adapted to approximate the optimal arms at each round. Moreover, the framework of the algorithms involves two stages in pursuit of the balance between exploration and exploitation. Theoretical analysis as well as numerical experiments demonstrate the effectiveness of our algorithms.



Paperid:1277
Authors:Shu-Ling Cheng, Chin-Yuan Yeh, Ting-An Chen, Eliana Pastor, Ming-Syan Chen
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan, Graduate Institute of Communication Engineering, National Taiwan University, Taiwan Institute of Information Science, Academia Sinica, Taiwan, Department of Electrical Engineering, National Taiwan University, Taiwan Institute of Information Science, Academia Sinica, Taiwan, Department of Control and Computer Engineering, Politecnico di Torino, Italy, Graduate Institute of Communication Engineering, National Taiwan University, Taiwan Department of Electrical Engineering, National Taiwan University, Taiwan
Abstract:
To achieve better performance and greater fairness in Federated Learning (FL), much of the existing research has centered on individual clients, using domain adaptation techniques and redesigned aggregation schemes to counteract client data heterogeneity. However, an overlooked scenario exists where clients belong to distinctive groups, or, client types, in which groups of clients share similar characteristics such as device specifications or data patterns. Despite being common in group collaborations, this scenario has been overlooked in previous research, potentially leading to performance degradation and systemic biases against certain client types. To bridge this gap, we introduce Federated learning with Group Customization and Reweighting (FedGCR). FedGCR enhances both performance and fairness for FL with Distinct Client Types, consisting of a Federated Group Customization (FedGC) model to provide customization via a novel prompt tuning technique to mitigate the data disparity across different clienttypes, and a Federated Group Reweighting (FedGR) aggregation scheme to ensure uniform and unbiased performances between clients and between client types by a novel reweighting approach. Extensive experiment comparisons with prior FL methods in domain adaptation and fairness demonstrate the superiority of FedGCR in all metrics, including the overall accuracy and performance uniformity in both the group and the individual level. FedGCR achieves 82.74% accuracy and 12.26(↓) in performance uniformity on the Digit-Five dataset and 81.88% and 14.88%(↓) on DomainNet with a domain imbalance factor of 10, which significantly outperforms the state-of-the-art. Code is available at https://github.com/celinezheng/fedgcr.



Paperid:1278
Authors:Xu Cheng, Hao Zhang, Yue Xin, Wen Shen, Quanshi Zhang
Nanjing University of Science and Technology, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Adversarial training is usually difficult to optimize. This paper provides conceptual and analytic insights into the difficulty of adversarial training via a simple theoretical study, where we derive an approximate dynamics of a recursive multistep attack in a simple setting. Despite the simplicity of our theory, it still reveals verifiable predictions about various phenomena in adversarial training under real-world settings. First, compared to vanilla training, adversarial training is more likely to boost the influence of input samples with large gradient norms in an exponential manner. Besides, adversarial training also strengthens the influence of the Hessian matrix of the loss w.r.t. network parameters, which is more likely to make network parameters oscillate and boosts the difficulty of adversarial training.



Paperid:1279
Authors:Yi Cheng, Renjun Hu, Haochao Ying, Xing Shi, Jian Wu, Wei Lin
Zhejiang University, Alibaba Group, Zhejiang University, Alibaba, Zhejiang University, Alibaba Group
Abstract:
Until recently, the question of the effective inductive bias of deep models on tabular data has remained unanswered. This paper investigates the hypothesis that arithmetic feature interaction is necessary for deep tabular learning. To test this point, we create a synthetic tabular dataset with a mild feature interaction assumption and examine a modified transformer architecture enabling arithmetical feature interactions, referred to as AMFormer. Results show that AMFormer outperforms strong counterparts in finegrained tabular data modeling, data efficiency in training, and generalization. This is attributed to its parallel additive and multiplicative attention operators and prompt-based optimization, which facilitate the separation of tabular samples in an extended space with arithmetically-engineered features. Our extensive experiments on real-world data also validate the consistent effectiveness, efficiency, and rationale of AMFormer, suggesting it has established a strong inductive bias for deep learning on tabular data. Code is available at https://github.com/aigc-apps/AMFormer.



Paperid:1280
Authors:Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, Qionghai Dai
Tsinghua University, Tsinghua University, Tsinghua University, Chinese PLA General Hospital, Tsinghua University, Chinese PLA General Hospital, Tsinghua University
Abstract:
Causal discovery in timeseries is a fundamental problem in the machine learning community, enabling causal reasoning and decision-making in complex scenarios. Recently, researchers successfully discover causality by combining neural networks with Granger causality, but their performances degrade largely when encountering high-dimensional data because of the highly redundant network design and huge causal graphs. Moreover, the missing entries in the observations further hamper the causal structural learning. To overcome these limitations, We propose CUTS+, which is built on the Granger-causality-based causal discovery method CUTS and raises the scalability by introducing a technique called Coarse-to-fine-discovery (C2FD) and leveraging a message-passing-based graph neural network (MPGNN). Compared to previous methods on simulated, quasi-real, and real datasets, we show that CUTS+ largely improves the causal discovery performance on high-dimensional data with different types of irregular sampling.



Paperid:1281
Authors:Jinjin Chi, Zhichao Zhang, Zhiyao Yang, Jihong Ouyang, Hongbin Pei
College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China, MOE KLINNS Lab, School of Cyber Science and Engineering, Xi'an Jiaotong University, China
Abstract:
Variational Inference (VI) has gained popularity as a flexible approximate inference scheme for computing posterior distributions in Bayesian models. Original VI methods use KullbackLeibler (KL) divergence to construct variational objectives. However, KL divergence has zero-forcing behavior and is completely agnostic to the metric of the underlying data distribution, resulting in bad approximations. To alleviate this issue, we propose a new variational objective by using Optimal Transport (OT) distance, which is a metric-aware divergence, to measure the difference between approximate posteriors and priors. The superior performance of OT distance enables us to learn more accurate approximations. We further enhance the objective by gradually including the OT term using a hyperparameter λ for over-parameterized models. We develop a Variational inference method with OT (VOT) which presents a gradient-based black-box framework for solving Bayesian models, even when the density function of approximate distribution is not available. We provide the consistency analysis of approximate posteriors and demonstrate the practical effectiveness on Bayesian neural networks and variational autoencoders.



Paperid:1282
Authors:Woojin Cho, Seunghyeon Cho, Hyundong Jin, Jinsung Jeon, Kookjin Lee, Sanghyun Hong, Dongeun Lee, Jonghyun Choi, Noseong Park
Yonsei University, Yonsei University, Yonsei University, Yonsei University, Arizona State University, Oregon State University, Texas A&M University-Commerce, Yonsei University, Yonsei University
Abstract:
Neural ordinary differential equations (NODEs), one of the most influential works of the differential equationbased deep learning, are to continuously generalize residual networks and opened a new field. They are currently utilized for various downstream tasks, e.g., image classification, time series classification, image generation, etc. Its key part is how to model the time-derivative of the hidden state, denoted dh(t)/dt. People have habitually used conventional neural network architectures, e.g., fully-connected layers followed by non-linear activations. In this paper, however, we present a neural operator-based method to define the time-derivative term. Neural operators were initially proposed to model the differential operator of partial differential equations (PDEs). Since the time-derivative of NODEs can be understood as a special type of the differential operator, our proposed method, called branched Fourier neural operator (BFNO), makes sense. In our experiments with general downstream tasks, our method significantly outperforms existing methods.



Paperid:1283
Authors:Youngjae Cho, HeeSun Bae, Seungjae Shin, Yeo Dong Youn, Weonyoung Joo, Il-Chul Moon
KAIST, KAIST, KAIST, Seoul National University, EWHA Womans University, KAIST
Abstract:
Recent visionlanguage pre-trained (VLP) models have become the backbone for many downstream tasks, but they are utilized as frozen model without learning. Prompt learning is a method to improve the pre-trained VLP model by adding a learnable context vector to the inputs of the text encoder. In a few-shot learning scenario of the downstream task, MLE training can lead the context vector to over-fit dominant image features in the training data. This overfitting can potentially harm the generalization ability, especially in the presence of a distribution shift between the training and test dataset. This paper presents a Bayesian-based framework of prompt tuning, which could alleviate the over-fitting issues on few-shot learning application and increase the adaptability of prompts on unobserved instances. Specifically, modeling data-dependent prior enhances the adaptability of text features for both seen and unseen image features without the trade-off of performance between them. Based on the Bayesian framework, we utilize the Wasserstein gradient flow in the estimation of our target posterior distribution, which enables our prompt to be flexible in capturing the complex modes of image features. We demonstrate the effectiveness of our method on benchmark datasets for several experiments by showing statistically significant improvements on performance compared to existing methods.



Paperid:1284
Authors:Jae Choi, Yuzhou Chen, Huikyo Lee, Hyun Kim, Yulia R. Gel
The University of Texas at Dallas, Temple University, NASA, Jet Propulsion Laboratory, California Institute of Technology, National Oceanic and Atmospheric Administration, Air Resources Laboratory, The University of Texas at Dallas National Science Foundation
Abstract:
Dynamics of many complex systems, from weather and climate to spread of infectious diseases, can be described by partial differential equations (PDEs). Such PDEs involve unknown function(s), partial derivatives, and typically multiple independent variables. The traditional numerical methods for solving PDEs assume that the data are observed on a regular grid. However, in many applications, for example, weather and air pollution monitoring delivered by the arbitrary located weather stations of the National Weather Services, data records are irregularly spaced. Furthermore, in problems involving prediction analytics such as forecasting wildfire smoke plumes, the primary focus may be on a set of irregular locations associated with urban development. In recent years, deep learning (DL) methods and, in particular, graph neural networks (GNNs) have emerged as a new promising tool that can complement traditional PDE solvers in scenarios of the irregular spaced data, contributing to the newest research trend of physics informed machine learning (PIML). However, most existing PIML methods tend to be limited in their ability to describe higher dimensional structural properties exhibited by real world phenomena, especially, ones that live on manifolds. To address this fundamental challenge, we bring the elements of the Hodge theory and, in particular, simplicial convolution defined on the Hodge Laplacian to the emerging nexus of DL and PDEs. In contrast to conventional Laplacian and the associated convolution operation, the simplicial convolution allows us to rigorously describe diffusion across higher order structures and to better approximate the complex underlying topology and geometry of the data. The new approach, Simplicial Neural Networks for Partial Differential Equations (SNN PDE) offers a computationally efficient yet effective solution for time dependent PDEs. Our studies of a broad range of synthetic data and wildfire processes demonstrate that SNN PDE improves upon state of the art baselines in handling unstructured grids and irregular time intervals of complex physical systems and offers competitive forecasting capabilities for weather and air quality forecasting.



Paperid:1285
Authors:Jongwook Choi, Sungtae Lee, Xinyu Wang, Sungryull Sohn, Honglak Lee
University of Michigan, Individual Researcher, University of Michigan, LG AI Research, University of Michigan LG AI Research
Abstract:
We present COIL (Counterfactual Object Interaction Learning), a novel way of learning skills of object interactions on entitycentric environments. The goal is to learn primitive behaviors that can induce interactions without external reward or any supervision. Existing skill discovery methods are limited to locomotion, simple navigation tasks, or single-object manipulation tasks, mostly not inducing interaction between objects. Unlike a monolithic representation usually used in prior skill learning methods, we propose to use a structured goal representation that can query and scope which objects to interact with, which can serve as a basis for solving more complex downstream tasks. We design a novel counterfactual intrinsic reward through the use of either a forward model or successor features that can learn an interaction skill between a pair of objects given as a goal. Through experiments on continuous control environments such as Magnetic Block and 2.5-D Stacking Box, we demonstrate that an agent can learn object interaction behaviors (e.g., attaching or stacking one block to another) without any external rewards or domain-specific knowledge.



Paperid:1286
Authors:Won-Seok Choi, Hyundo Lee, Dong-Sig Han, Junseok Park, Heeyeon Koo, Byoung-Tak Zhang
Seoul National University, Seoul National University, Seoul National University, Seoul National University, Yonsei University, Seoul National University AI Institute of Seoul National University (AIIS)
Abstract:
Recent machine learning algorithms have been developed using wellcurated datasets, which often require substantial cost and resources. On the other hand, the direct use of raw data often leads to overfitting towards frequently occurring class information. To address class imbalances cost-efficiently, we propose an active data filtering process during self-supervised pre-training in our novel framework, Duplicate Elimination (DUEL). This framework integrates an active memory inspired by human working memory and introduces distinctiveness information, which measures the diversity of the data in the memory, to optimize both the feature extractor and the memory. The DUEL policy, which replaces the most duplicated data with new samples, aims to enhance the distinctiveness information in the memory and thereby mitigate class imbalances. We validate the effectiveness of the DUEL framework in class-imbalanced environments, demonstrating its robustness and providing reliable results in downstream tasks. We also analyze the role of the DUEL policy in the training process through various metrics and visualizations.



Paperid:1287
Authors:Wonjeong Choi, Jungwuk Park, Dong-Jun Han, Younghyun Park, Jaekyun Moon
Korea Advanced Institute of Science and Technology (KAIST), Korea Advanced Institute of Science and Technology (KAIST), Purdue University, Korea Advanced Institute of Science and Technology (KAIST), Korea Advanced Institute of Science and Technology (KAIST)
Abstract:
Research interests in the robustness of deep neural networks against domain shifts have been rapidly increasing in recent years. Most existing works, however, focus on improving the accuracy of the model, not the calibration performance which is another important requirement for trustworthy AI systems. Temperature scaling (TS), an accuracypreserving post-hoc calibration method, has been proven to be effective in in-domain settings, but not in out-of-domain (OOD) due to the difficulty in obtaining a validation set for the unseen domain beforehand. In this paper, we propose consistency-guided temperature scaling (CTS), a new temperature scaling strategy that can significantly enhance the OOD calibration performance by providing mutual supervision among data samples in the source domains. Motivated by our observation that over-confidence stemming from inconsistent sample predictions is the main obstacle to OOD calibration, we propose to guide the scaling process by taking consistencies into account in terms of two different aspects - style and content - which are the key components that can well-represent data samples in multi-domain settings. Experimental results demonstrate that our proposed strategy outperforms existing works, achieving superior OOD calibration performance on various datasets. This can be accomplished by employing only the source domains without compromising accuracy, making our scheme directly applicable to various trustworthy AI systems.



Paperid:1288
Authors:Agniva Chowdhury, Pradeep Ramuhalli
Oak Ridge National Laboratory, TN, USA, Oak Ridge National Laboratory, TN, USA
Abstract:
In statistics and machine learning, logistic regression is a widelyused supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.



Paperid:1289
Authors:Jayabrata Chowdhury, Venkataramanan Shivaraman, Suresh Sundaram, PB Sujit
Indian Institute of Science, Bengaluru, Indian Institute of Science Education and Research, Bhopal, Indian Institute of Science, Bengaluru, Indian Institute of Science Education and Research, Bhopal
Abstract:
Recent advancements in motion planning for Autonomous Vehicles (AVs) show great promise in using expert driver behaviors in nonstationary driving environments. However, learning only through expert drivers needs more generalizability to recover from domain shifts and near-failure scenarios due to the dynamic behavior of traffic participants and weather conditions. A deep Graph-based Prediction and Planning Policy Network (GP3Net) framework is proposed for non-stationary environments that encodes the interactions between traffic participants with contextual information and provides a decision for safe maneuver for AV. A spatio-temporal graph models the interactions between traffic participants for predicting the future trajectories of those participants. The predicted trajectories are utilized to generate a future occupancy map around the AV with uncertainties embedded to anticipate the evolving non-stationary driving environments. Then the contextual information and future occupancy maps are input to the policy network of the GP3Net framework and trained using Proximal Policy Optimization (PPO) algorithm. The proposed GP3Net performance is evaluated on standard CARLA benchmarking scenarios with domain shifts of traffic patterns (urban, highway, and mixed). The results show that the GP3Net outperforms previous state-of-the-art imitation learning-based planning models for different towns. Further, in unseen new weather conditions, GP3Net completes the desired route with fewer traffic infractions. Finally, the results emphasize the advantage of including the prediction module to enhance safety measures in non-stationary environments.



Paperid:1290
Authors:Haoyu Chu, Shikui Wei, Ting Liu, Yao Zhao, Yuto Miyatake
Institute of Information Science, Beijing Jiaotong University Graduate School of Information Science and Technology, Osaka University Beijing Key Laboratory of Advanced Information Science and Network Technology, Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, School of Computer Science, Northwestern Polytechnical University, Institute of Information Science, Beijing Jiaotong University Beijing Key Laboratory of Advanced Information Science and Network Technology, Cybermedia Center, Osaka University
Abstract:
Deep equilibrium (DEQ) models have emerged as a promising class of implicit layer models, which abandon traditional depth by solving for the fixed points of a single nonlinear layer. Despite their success, the stability of the fixed points for these models remains poorly understood. By considering DEQ models as nonlinear dynamic systems, we propose a robust DEQ model named LyaDEQ with guaranteed provable stability via Lyapunov theory. The crux of our method is ensuring the Lyapunov stability of the DEQ model's fixed points, which enables the proposed model to resist minor initial perturbations. To avoid poor adversarial defense due to Lyapunovstable fixed points being located near each other, we orthogonalize the layers after the Lyapunov stability module to separate different fixed points. We evaluate LyaDEQ models under well-known adversarial attacks, and experimental results demonstrate significant improvement in robustness. Furthermore, we show that the LyaDEQ model can be combined with other defense methods, such as adversarial training, to achieve even better adversarial robustness.



Paperid:1291
Authors:Xiangxiang Chu, Liang Li, Bo Zhang
Meituan, Meituan, Meituan
Abstract:
The tradeoff between performance and inference speed is critical for practical applications. Architecture reparameterization obtains better tradeoffs and it is becoming an increasingly popular ingredient in modern convolutional neural networks. Nonetheless, its quantization performance is usually too poor to deploy (e.g. more than 20% top1 accuracy drop on ImageNet) when INT8 inference is desired. In this paper, we dive into the underlying mechanism of this failure, where the original design inevitably enlarges quantization error. We propose a simple, robust, and effective remedy to have a quantization-friendly structure that also enjoys reparameterization benefits. Our method greatly bridges the gap between INT8 and FP32 accuracy for RepVGG. Without bells and whistles, the top-1 accuracy drop on ImageNet is reduced within 2% by standard post-training quantization. Extensive experiments on detection and semantic segmentation tasks verify its generalization.



Paperid:1292
Authors:Zhendong Chu, Renqin Cai, Hongning Wang
University of Virginia, Meta, University of Virginia
Abstract:
Metareinforcement learning (meta-RL) aims to quickly solve new RL tasks by leveraging knowledge from prior tasks. Previous studies often assume a single-mode homogeneous task distribution, ignoring possible structured heterogeneity among tasks. Such an oversight can hamper effective exploration and adaptation, especially with limited samples. In this work, we harness the structured heterogeneity among tasks via clustering to improve meta-RL, which facilitates knowledge sharing at the cluster level. To facilitate exploration, we also develop a dedicated cluster-level exploratory policy to discover task clusters via divide-and-conquer. The knowledge from the discovered clusters helps to narrow the search space of task-specific policy learning, leading to more sample-efficient policy adaptation. We evaluate the proposed method on environments with parametric clusters (e.g., rewards and state dynamics in the MuJoCo suite) and non-parametric clusters (e.g., control skills in the Meta-World suite). The results demonstrate strong advantages of our solution against a set of representative meta-RL methods.



Paperid:1293
Authors:Zhixuan Chu, Mengxuan Hu, Qing Cui, Longfei Li, Sheng Li
Ant Group, University of Virginia, Ant Group, Ant Group, University of Virginia
Abstract:
Since artificial intelligence has seen tremendous recent successes in many areas, it has sparked great interest in its potential for trustworthy and interpretable risk prediction. However, most models lack causal reasoning and struggle with class imbalance, leading to poor precision and recall. To address this, we propose a TaskDriven Causal Feature Distillation model (TDCFD) to transform original feature values into causal feature attributions for the specific risk prediction task. The causal feature attribution helps describe how much contribution the value of this feature can make to the risk prediction result. After the causal feature distillation, a deep neural network is applied to produce trustworthy prediction results with causal interpretability and high precision/recall. We evaluate the performance of our TDCFD method on several synthetic and real datasets, and the results demonstrate its superiority over the state-of-the-art methods regarding precision, recall, interpretability, and causality.



Paperid:1294
Authors:Joseph Clements, Yingjie Lao
Clemson University, Clemson, South Carolina, 29634 Applied Research Associates, Albuquerque, New Mexico, 87110, Clemson University, Clemson, South Carolina, 29634 Tufts University, Medford, Massachusetts, 02155
Abstract:
Deep learning intellectual properties (IPs) are highvalue assets that are frequently susceptible to theft. This vulnerability has led to significant interest in defending the field's intellectual properties from theft. Recently, watermarking techniques have been extended to protect deep learning hardware from privacy. These technique embed modifications that change the hardware's behavior when activated. In this work, we propose the first method for embedding watermarks in deep learning hardware that incorporates the owner's key samples into the embedding methodology. This improves our watermarks' reliability and efficiency in identifying the hardware over those generated using randomly selected key samples. Our experimental results demonstrate that by considering the target key samples when generating the hardware modifications, we can significantly increase the embedding success rate while targeting fewer functional blocks, decreasing the required hardware overhead needed to defend it.



Paperid:1295
Authors:Daniel Coelho, Miguel Oliveira, Vitor Santos
Department of Mechanical Engineering, University of Aveiro, 3810-193 Aveiro, Portugal; Intelligent System Associate Laboratory (LASI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, 3810-193 Aveiro, Portugal, Department of Mechanical Engineering, University of Aveiro, 3810-193 Aveiro, Portugal; Intelligent System Associate Laboratory (LASI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, 3810-193 Aveiro, Portugal, Department of Mechanical Engineering, University of Aveiro, 3810-193 Aveiro, Portugal; Intelligent System Associate Laboratory (LASI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, 3810-193 Aveiro, Portugal
Abstract:
Reinforcement Learning from Demonstrations (RLfD) has emerged as an effective method by fusing expert demonstrations into Reinforcement Learning (RL) training, harnessing the strengths of both Imitation Learning (IL) and RL. However, existing algorithms rely on offline demonstrations, which can introduce a distribution gap between the demonstrations and the actual training environment, limiting their performance. In this paper, we propose a novel approach, Reinforcement Learning from Online Demonstrations (RLfOLD), that leverages online demonstrations to address this limitation, ensuring the agent learns from relevant and upto-date scenarios, thus effectively bridging the distribution gap. Unlike conventional policy networks used in typical actor-critic algorithms, RLfOLD introduces a policy network that outputs two standard deviations: one for exploration and the other for IL training. This novel design allows the agent to adapt to varying levels of uncertainty inherent in both RL and IL. Furthermore, we introduce an exploration process guided by an online expert, incorporating an uncertainty-based technique. Our experiments on the CARLA NoCrash benchmark demonstrate the effectiveness and efficiency of RLfOLD. Notably, even with a significantly smaller encoder and a single camera setup, RLfOLD surpasses state-of-the-art methods in this evaluation. These results, achieved with limited resources, highlight RLfOLD as a highly promising solution for real-world applications.



Paperid:1296
Authors:Yulai Cong, Sijia Li
Sun Yat-sen University, Sun Yat-sen University
Abstract:
Mixture models serve as one fundamental tool with versatile applications. However, their training techniques, like the popular Expectation Maximization (EM) algorithm, are notoriously sensitive to parameter initialization and often suffer from bad local optima that could be arbitrarily worse than the optimal. To address the longlasting bad-local-optima challenge, we draw inspiration from the recent ground-breaking foundation models and propose to leverage their underlying big learning principle to upgrade the EM. Specifically, we present the Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs joint, marginal, and orthogonally transformed marginal matchings between data and model distributions. Through simulated experiments, we empirically show that the BigLearn-EM is capable of delivering the optimal with high probability; comparisons on benchmark clustering datasets further demonstrate its effectiveness and advantages over existing techniques. The code is available at https://github.com/YulaiCong/Big-Learning-Expectation-Maximization.



Paperid:1297
Authors:Baris Coskunuzer, Ignacio Segovia-Dominguez, Yuzhou Chen, Yulia R. Gel
University of Texas at Dallas, Department of Mathematical Sciences, West Virginia University, School of Mathematical & Data Sciences, Temple University, Department of Computer and Information Sciences, University of Texas at Dallas, Department of Mathematical Sciences National Science Foundation
Abstract:
Learning timeevolving objects such as multivariate time series and dynamic networks requires the development of novel knowledge representation mechanisms and neural network architectures, which allow for capturing implicit time-dependent information contained in the data. Such information is typically not directly observed but plays a key role in the learning task performance. In turn, lack of time dimension in knowledge encoding mechanisms for time-dependent data leads to frequent model updates, poor learning performance, and, as a result, subpar decision-making. Here we propose a new approach to a time-aware knowledge representation mechanism that notably focuses on implicit time-dependent topological information along multiple geometric dimensions. In particular, we propose a new approach, named Temporal MultiPersistence (TMP), which produces multidimensional topological fingerprints of the data by using the existing single parameter topological summaries. The main idea behind TMP is to merge the two newest directions in topological representation learning, that is, multi-persistence which simultaneously describes data shape evolution along multiple key parameters, and zigzag persistence to enable us to extract the most salient data shape information over time. We derive theoretical guarantees of TMP vectorizations and show its utility, in application to forecasting on benchmark traffic flow, Ethereum blockchain, and electrocardiogram datasets, demonstrating the competitive performance, especially, in scenarios of limited data records. In addition, our TMP method improves the computational efficiency of the state-of-the-art multipersistence summaries up to 59.5 times.



Paperid:1298
Authors:Jing Cui, Yufei Han, Yuzhe Ma, Jianbin Jiao, Junge Zhang
University of Chinese Academy of Sciences, INRIA, Microsoft Azure AI, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences
Abstract:
Backdoor attacks in reinforcement learning (RL) have previously employed intense attack strategies to ensure attack success. However, these methods suffer from high attack costs and increased detectability. In this work, we propose a novel approach, BadRL, which focuses on conducting highly sparse backdoor poisoning efforts during training and testing while maintaining successful attacks. Our algorithm, BadRL, strategically chooses state observations with high attack values to inject triggers during training and testing, thereby reducing the chances of detection. In contrast to the previous methods that utilize sampleagnostic trigger patterns, BadRL dynamically generates distinct trigger patterns based on targeted state observations, thereby enhancing its effectiveness. Theoretical analysis shows that the targeted backdoor attack is always viable and remains stealthy under specific assumptions. Empirical results on various classic RL tasks illustrate that BadRL can substantially degrade the performance of a victim agent with minimal poisoning efforts (0.003% of total training steps) during training and infrequent attacks during testing. Code is available at: https://github.com/7777777cc/code.



Paperid:1299
Authors:Shuang Cui, Kai Han, He Huang
School of Computer Science and Technology / Suzhou Institute for Advanced Research, University of Science and Technology of China, School of Computer Science and Technology, Soochow University, School of Computer Science and Technology, Soochow University
Abstract:
Submodular maximization algorithms have found wide applications in various fields such as data summarization, recommendation systems, and active learning. In recent years, deletionrobust submodular maximization algorithms have garnered attention due to their significant implications in scenarios where some data points may be removed due to user preferences or privacy concerns, such as in recommendation systems and influence maximization. In this paper, we study the fundamental problem of submodular maximization with knapsack constraints and propose a robust streaming algorithm for it. To the best of our knowledge, our algorithm is the first to solve this problem for non-monotone submodular functions and can achieve an approximation ratio of 1/(6.82+2.63d)-ϵ under a near-optimal summary size of O(k+r), where k denotes the maximum cardinality of any feasible solution, d denotes the number of the knapsack constraints and r is the robustness parameter. For monotone submodular functions, our algorithm can achieve an approximation ratio of 1/(2+2d)-ϵ under a near-optimal summary size of O(k+r), significantly improving upon the best-known ratio of Ω((1/d-ϵ)^2). The empirical performance of our algorithm is extensively evaluated in several applications including influence maximization and recommendation systems, and the experimental results demonstrate the effectiveness of our algorithm.



Paperid:1300
Authors:Zhenyu Cui, Yuxin Peng, Xun Wang, Manyu Zhu, Jiahuan Zhou
Peking University, Peking University, ByteDance Inc, ByteDance Inc, Peking University
Abstract:
The recent largescale pre-trained models like CLIP have aroused great concern in vision-language tasks. However, when required to match image-text data collected in a streaming manner, namely Continual Vision-Language Retrieval (CVRL), their performances are still limited due to the catastrophic forgetting of the learned old knowledge. To handle this issue, advanced methods are proposed to distill the affinity knowledge between images and texts from the old model to the new one for anti-forgetting. Unfortunately, existing approaches neglect the impact of incorrect affinity, which prevents the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge. Therefore, we propose a novel framework called Dynamic Knowledge Rectification (DKR) that simultaneously achieves incorrect knowledge filtering and rectification. Specifically, we first filter the incorrect affinity knowledge calculated by the old model on the new data. Then, a knowledge rectification method is designed to rectify the incorrect affinities while preserving the correct ones. In particular, for the new data that can only be correctly retrieved by the new model, we rectify them with the corresponding new affinity to protect them from negative transfer. Additionally, for those that can not be retrieved by either the old or the new model, we introduce paired ground-truth labels to promote the acquisition of both old and new knowledge. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our DKR and its superiority against state-of-the-art methods.



Paperid:1301
Authors:Wenqi Dang, Zhou Yang, Weisheng Dong, Xin Li, Guangming Shi
Xidian University, Xidian University, Xidian University, West Virginia University, Xidian University Peng Cheng Laboratory
Abstract:
The performance of deep learning models often degrades rapidly when faced with imbalanced data characterized by a longtailed distribution. Researchers have found that the fully connected layer trained by cross-entropy loss has large weight-norms for classes with many samples, but not for classes with few samples. How to address the data imbalance problem with both the encoder and the classifier seems an under-researched problem. In this paper, we propose an inverse weight-balancing (IWB) approach to guide model training and alleviate the data imbalance problem in two stages. In the first stage, an encoder and classifier (the fully connected layer) are trained using conventional cross-entropy loss. In the second stage, with a fixed encoder, the classifier is finetuned through an adaptive distribution for IWB in the decision space. Unlike existing inverse image frequency that implements a multiplicative margin adjustment transformation in the classification layer, our approach can be interpreted as an adaptive distribution alignment strategy using not only the class-wise number distribution but also the sample-wise difficulty distribution in both encoder and classifier. Experiments show that our method can greatly improve performance on imbalanced datasets such as CIFAR100-LT with different imbalance factors, ImageNet-LT, and iNaturelists2018.



Paperid:1302
Authors:Aram Davtyan, Paolo Favaro
University of Bern, University of Bern
Abstract:
We propose a novel unsupervised method to autoregressively generate videos from a single frame and a sparse motion input. Our trained model can generate unseen realistic objectto-object interactions. Although our model has never been given the explicit segmentation and motion of each object in the scene during training, it is able to implicitly separate their dynamics and extents. Key components in our method are the randomized conditioning scheme, the encoding of the input motion control, and the randomized and sparse sampling to enable generalization to out of distribution but realistic correlations. Our model, which we call YODA, has therefore the ability to move objects without physically touching them. Through extensive qualitative and quantitative evaluations on several datasets, we show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.



Paperid:1303
Authors:William de Vazelhes, Bhaskar Mukhoty, Xiao-Tong Yuan, Bin Gu
MBZUAI, Abu Dhabi, UAE, MBZUAI, Abu Dhabi, UAE, Nanjing University, Suzhou, China, MBZUAI, Abu Dhabi, UAE Jilin University, Changchun, China
Abstract:
Sparse recovery is ubiquitous in machine learning and signal processing. Due to the NPhard nature of sparse recovery, existing methods are known to suffer either from restrictive (or even unknown) applicability conditions, or high computational cost. Recently, iterative regularization methods have emerged as a promising fast approach because they can achieve sparse recovery in one pass through early stopping, rather than the tedious grid-search used in the traditional methods. However, most of those iterative methods are based on the l1 norm which requires restrictive applicability conditions and could fail in many cases. Therefore, achieving sparse recovery with iterative regularization methods under a wider range of conditions has yet to be further explored. To address this issue, we propose a novel iterative regularization algorithm, IRKSN, based on the k-support norm regularizer rather than the l1 norm. We provide conditions for sparse recovery with IRKSN, and compare them with traditional conditions for recovery with l1 norm regularizers. Additionally, we give an early stopping bound on the model error of IRKSN with explicit constants, achieving the standard linear rate for sparse recovery. Finally, we illustrate the applicability of our algorithm on several experiments, including a support recovery experiment with a correlated design matrix.



Paperid:1304
Authors:Swakshar Deb, Sejuti Rahman, Shafin Rahman
University of Dhaka, University of Dhaka, North South University
Abstract:
The utilization of waveletbased techniques in graph neural networks (GNNs) has gained considerable attention, particularly in the context of node classification. Although existing wavelet-based approaches have shown promise, they are constrained by their reliance on pre-defined wavelet filters, rendering them incapable of effectively adapting to signals that reside on graphs based on tasks at hand. Recent research endeavors address this issue through the introduction of a wavelet lifting transform. However, this technique necessitates the use of bipartite graphs, causing a transformation of the original graph structure into a bipartite configuration. This alteration of graph topology results in the generation of undesirable wavelet filters, thereby undermining the effectiveness of the method. In response to these challenges, we propose a novel simple and effective adaptive graph wavelet neural network (SEA-GWNN) class that employs the lifting scheme on arbitrary graph structures while upholding the original graph topology by leveraging multi-hop computation trees. A noteworthy aspect of the approach is the focus on local substructures represented as acyclic trees, wherein the lifting strategy is applied in a localized manner. This locally defined lifting scheme effectively combines high-pass and low-pass frequency information to enhance node representations. Furthermore, to reduce computing costs, we propose to decouple the higher- order lifting operators and induce them from the lower-order structures. Finally, we benchmark our model on several real- world datasets spanning four distinct categories, including citation networks, webpages, the film industry, and large-scale graphs and the experimental results showcase the efficacy of the proposed SEA-GWNN.



Paperid:1305
Authors:Jiale Deng, Yanyan Shen
Department of Computer Science and Engineering Shanghai Jiao Tong University, Department of Computer Science and Engineering Shanghai Jiao Tong University
Abstract:
Selfinterpretable graph learning methods provide insights to unveil the black-box nature of GNNs by providing predictions with built-in explanations. However, current works suffer from performance degradation compared to GNNs trained without built-in explanations. We argue the main reason is that they fail to generate explanations satisfying both sufficiency and necessity, and the biased explanations further hurt GNNs' performance. In this work, we propose a novel framework for generating SUfficient aNd NecessarY explanations (SUNNY-GNN for short) that benefit GNNs' predictions. The key idea is to conduct augmentations by structurally perturbing given explanations and employ a contrastive loss to guide the learning of explanations toward sufficiency and necessity directions. SUNNY-GNN introduces two coefficients to generate hard and reliable contrastive samples. We further extend SUNNY-GNN to heterogeneous graphs. Empirical results on various GNNs and real-world graphs show that SUNNY-GNN yields accurate predictions and faithful explanations, outperforming the state-of-the-art methods by improving 3.5% prediction accuracy and 13.1% explainability fidelity on average. Our code and data are available at https://github.com/SJTU-Quant/SUNNY-GNN.



Paperid:1306
Authors:Sen Deng, Yidan Feng, Haoneng Lin, Yiting Fan, Alex Pui-Wai Lee, Xiaowei Hu, Jing Qin
Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Department of cardiology, Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200030, China, Division of Cardiology, Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory, Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University
Abstract:
Semisupervised learning (SSL) is a powerful tool to address the challenge of insufficient annotated data in medical segmentation problems. However, existing semi-supervised methods mainly rely on internal knowledge for pseudo labeling, which is biased due to the distribution mismatch between the highly imbalanced labeled and unlabeled data. Segmenting left atrial appendage (LAA) from transesophageal echocardiogram (TEE) images is a typical medical image segmentation task featured by scarcity of professional annotations and diverse data distributions, for which existing SSL models cannot achieve satisfactory performance. In this paper, we propose a novel strategy to mitigate the inherent challenge of distribution mismatch in SSL by, for the first time, incorporating a large foundation model (i.e. SAM in our implementation) into an SSL model to improve the quality of pseudo labels. We further propose a new self-reconstruction mechanism to generate both noise-resilient prompts to demonically improve SAM’s generalization capability over TEE images and self-perturbations to stabilize the training process and reduce the impact of noisy labels. We conduct extensive experiments on an in-house TEE dataset; experimental results demonstrate that our method achieves better performance than state-of-the-art SSL models.



Paperid:1307
Authors:Cedric Derstroff, Mattia Cerrato, Jannis Brugger, Jan Peters, Stefan Kramer
Technische Universität Darmstadt Hessian Center for Artificial Intelligence (hessian.AI), Johannes Gutenberg-Universität Mainz, Technische Universität Darmstadt Hessian Center for Artificial Intelligence (hessian.AI), Technische Universität Darmstadt Hessian Center for Artificial Intelligence (hessian.AI) German Research Center for AI (DFKI) Centre for Cognitive Science, Johannes Gutenberg-Universität Mainz
Abstract:
Peer learning is a novel highlevel reinforcement learning framework for agents learning in groups. While standard reinforcement learning trains an individual agent in trial-and-error fashion, all on its own, peer learning addresses a related setting in which a group of agents, i.e., peers, learns to master a task simultaneously together from scratch. Peers are allowed to communicate only about their own states and actions recommended by others: "What would you do in my situation?". Our motivation is to study the learning behavior of these agents. We formalize the teacher selection process in the action advice setting as a multi-armed bandit problem and therefore highlight the need for exploration. Eventually, we analyze the learning behavior of the peers and observe their ability to rank the agents' performance within the study group and understand which agents give reliable advice. Further, we compare peer learning with single agent learning and a state-of-the-art action advice baseline. We show that peer learning is able to outperform single-agent learning and the baseline in several challenging discrete and continuous OpenAI Gym domains. Doing so, we also show that within such a framework complex policies from action recommendations beyond discrete action spaces can evolve.



Paperid:1308
Authors:Nicolas Deutschmann, Marvin Alberts, María Rodríguez Martínez
IBM Research, IBM Research Zurich, IBM Research
Abstract:
We introduce two new extensions to the beam search algorithm based on conformal predictions (CP) to produce sets of sequences with theoretical coverage guarantees. The first method is very simple and proposes dynamicallysized subsets of beam search results but, unlike typical CP proceedures, has an upper bound on the achievable guarantee depending on a post-hoc calibration measure. Our second algorithm introduces the conformal set prediction procedure as part of the decoding process, producing a variable beam width which adapts to the current uncertainty. While more complex, this procedure can achieve coverage guarantees selected a priori. We provide marginal coverage bounds as well as calibration-conditional guarantees for each method, and evaluate them empirically on a selection of tasks drawing from natural language processing and chemistry.



Paperid:1309
Authors:Yiqun Diao, Qinbin Li, Bingsheng He
National University of Singapore, UC Berkeley, National University of Singapore
Abstract:
Federated Learning (FL) has emerged as a promising solution to perform deep learning on different data owners without exchanging raw data. However, nonIID data has been a key challenge in FL, which could significantly degrade the accuracy of the final model. Among different non-IID types, label skews have been challenging and common in image classification and other tasks. Instead of averaging the local models in most previous studies, we propose FedConcat, a simple and effective approach that concatenates these local models as the base of the global model to effectively aggregate the local knowledge. To reduce the size of the global model, we adopt the clustering technique to group the clients by their label distributions and collaboratively train a model inside each cluster. We theoretically analyze the advantage of concatenation over averaging by analyzing the information bottleneck of deep neural networks. Experimental results demonstrate that FedConcat achieves significantly higher accuracy than previous state-of-the-art FL methods in various heterogeneous label skew distribution settings and meanwhile has lower communication costs. Our code is publicly available at https://github.com/sjtudyq/FedConcat.



Paperid:1310
Authors:Xiaojian Ding, Fan Yang
Nanjing University of Finance and Economics, Nanjing University of Finance and Economics
Abstract:
Multi kernel learning (MKL) is a representative supervised multiview learning method widely applied in multi-modal and multi-view applications. MKL aims to classify data by integrating complementary information from predefined kernels. Although existing MKL methods achieve promising performance, they fail to consider the tradeoff between diversity and classification accuracy of kernels, preventing further improvement of classification performance. In this paper, we tackle this problem by generating a number of high-quality base learning kernels and selecting a kernel subset with maximum pairwise diversity and minimum generalization errors. We first formulate this idea as a nonconvex quadratic integer programming problem. Then we transform this nonconvex problem into a convex optimization problem and prove it is equivalent to a semidefinite relaxation problem, which a semidefinite-based branch-and-bound algorithm can quickly solve. Experimental results on the real-world datasets demonstrate the superiority of the proposed method. The results also show that our method works for the support vector machine (SVM) classifier and other state-of-the-art kernel classifiers.



Paperid:1311
Authors:Xin Ding, Yongwei Wang, Zuheng Xu
Nanjing University of Information Science & Technology, Shanghai Institute for Advanced Study, Zhejiang University, University of British Columbia
Abstract:
Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and classconditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance. Our codes can be found at https://github.com/UBCDingXin/Dual-NDA.



Paperid:1312
Authors:Yongqi Ding, Lin Zuo, Mengmeng Jing, Pei He, Yongjun Xiao
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Neuromorphic object recognition with spiking neural networks (SNNs) is the cornerstone of lowpower neuromorphic computing. However, existing SNNs suffer from significant latency, utilizing 10 to 40 timesteps or more, to recognize neuromorphic objects. At low latencies, the performance of existing SNNs is drastically degraded. In this work, we propose the Shrinking SNN (SSNN) to achieve low-latency neuromorphic object recognition without reducing performance. Concretely, we alleviate the temporal redundancy in SNNs by dividing SNNs into multiple stages with progressively shrinking timesteps, which significantly reduces the inference latency. During timestep shrinkage, the temporal transformer smoothly transforms the temporal scale and preserves the information maximally. Moreover, we add multiple early classifiers to the SNN during training to mitigate the mismatch between the surrogate gradient and the true gradient, as well as the gradient vanishing/exploding, thus eliminating the performance degradation at low latency. Extensive experiments on neuromorphic datasets, CIFAR10-DVS, N-Caltech101, and DVS-Gesture have revealed that SSNN is able to improve the baseline accuracy by 6.55% ~ 21.41%. With only 5 average timesteps and without any data augmentation, SSNN is able to achieve an accuracy of 73.63% on CIFAR10-DVS. This work presents a heterogeneous temporal scale SNN and provides valuable insights into the development of high-performance, low-latency SNNs.



Paperid:1313
Authors:Mingjiang Duan, Tongya Zheng, Yang Gao, Gang Wang, Zunlei Feng, Xinyu Wang
Zhejiang University, Hangzhou City University Zhejiang University, Zhejiang University, Bangsheng Technology Co,Ltd. ZJU-Bangsun Joint Research Center, Zhejiang University Shanghai Institute for Advanced Study of Zhejiang University, Zhejiang University ZJU-Bangsun Joint Research Center
Abstract:
Fraud detection has increasingly become a prominent research field due to the dramatically increased incidents of fraud. The complex connections involving thousands, or even millions of nodes, present challenges for fraud detection tasks. Many researchers have developed various graphbased methods to detect fraud from these intricate graphs. However, those methods neglect two distinct characteristics of the fraud graph: the non-additivity of certain attributes and the distinguishability of grouped messages from neighbor nodes. This paper introduces the Dynamic Grouping Aggregation Graph Neural Network (DGA-GNN) for fraud detection, which addresses these two characteristics by dynamically grouping attribute value ranges and neighbor nodes. In DGA-GNN, we initially propose the decision tree binning encoding to transform non-additive node attributes into bin vectors. This approach aligns well with the GNN’s aggregation operation and avoids nonsensical feature generation. Furthermore, we devise a feedback dynamic grouping strategy to classify graph nodes into two distinct groups and then employ a hierarchical aggregation. This method extracts more discriminative features for fraud detection tasks. Extensive experiments on five datasets suggest that our proposed method achieves a 3% ~ 16% improvement over existing SOTA methods. Code is available at https://github.com/AtwoodDuan/DGA-GNN.



Paperid:1314
Authors:Yue Duan, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
Nanjing University, The University of Sydney, Southeast University, The University of Sydney, University of Wollongong, Nanjing University
Abstract:
While semisupervised learning (SSL) has yielded promising results, the more realistic SSL scenario remains to be explored, in which the unlabeled data exhibits extremely high recognition difficulty, e.g., fine-grained visual classification in the context of SSL (SS-FGVC). The increased recognition difficulty on fine-grained unlabeled data spells disaster for pseudo-labeling accuracy, resulting in poor performance of the SSL model. To tackle this challenge, we propose Soft Label Selection with Confidence-Aware Clustering based on Class Transition Tracking (SoC) by reconstructing the pseudo-label selection process by jointly optimizing Expansion Objective and Shrinkage Objective, which is based on a soft label manner. Respectively, the former objective encourages soft labels to absorb more candidate classes to ensure the attendance of ground-truth class, while the latter encourages soft labels to reject more noisy classes, which is theoretically proved to be equivalent to entropy minimization. In comparisons with various state-of-the-art methods, our approach demonstrates its superior performance in SS-FGVC. Checkpoints and source code are available at https://github.com/NJUyued/SoC4SS-FGVC.



Paperid:1315
Authors:Béni Egressy, Luc von Niederhäusern, Jovan Blanuša, Erik Altman, Roger Wattenhofer, Kubilay Atasu
ETH Zurich, Zurich, Switzerland IBM Research Europe, Zurich, Switzerland, ETH Zurich, Zurich, Switzerland IBM Research Europe, Zurich, Switzerland, IBM Research Europe, Zurich, Switzerland, IBM Watson Research, Yorktown Heights, NY, USA, ETH Zurich, Zurich, Switzerland, IBM Research Europe, Zurich, Switzerland
Abstract:
This paper analyses a set of simple adaptations that transform standard messagepassing Graph Neural Networks (GNN) into provably powerful directed multigraph neural networks. The adaptations include multigraph port numbering, ego IDs, and reverse message passing. We prove that the combination of these theoretically enables the detection of any directed subgraph pattern. To validate the effectiveness of our proposed adaptations in practice, we conduct experiments on synthetic subgraph detection tasks, which demonstrate outstanding performance with almost perfect results. Moreover, we apply our proposed adaptations to two financial crime analysis tasks. We observe dramatic improvements in detecting money laundering transactions, improving the minority-class F1 score of a standard message-passing GNN by up to 30%, and closely matching or outperforming tree-based and GNN baselines. Similarly impressive results are observed on a real-world phishing detection dataset, boosting three standard GNNs’ F1 scores by around 15% and outperforming all baselines. An extended version with appendices can be found on arXiv: https://arxiv.org/abs/2306.11586.



Paperid:1316
Authors:Ahmad-Reza Ehyaei, Kiarash Mohammadi, Amir-Hossein Karimi, Samira Samadi, Golnoosh Farnadi
Max Planck Institute for Intelligent Systems, Université de Montréal, Montréal, Canada Mila - Québec AI Institute, Montréal, Canada, Max Planck Institute for Intelligent Systems Germany, Max Planck Institute for Intelligent Systems, Université de Montréal, Montréal, Canada Mila - Québec AI Institute, Montréal, Canada McGill University, Montréal, Canada
Abstract:
As responsible AI gains importance in machine learning algorithms, properties like fairness, adversarial robustness, and causality have received considerable attention in recent years. However, despite their individual significance, there remains a critical gap in simultaneously exploring and integrating these properties. In this paper, we propose a novel approach that examines the relationship between individual fairness, adversarial robustness, and structural causal models (SCMs) in heterogeneous data spaces, particularly when dealing with discrete sensitive attributes. We use SCMs and sensitive attributes to create a fair metric and apply it to measure semantic similarity among individuals. By introducing a novel causal adversarial perturbation (CAP) and applying adversarial training, we create a new regularizer that combines individual fairness, causality, and robustness in the classifier. Our method is evaluated on both realworld and synthetic datasets, demonstrating its effectiveness in achieving an accurate classifier that simultaneously exhibits fairness, adversarial robustness, and causal awareness.



Paperid:1317
Authors:Ouns El Harzli, Bernardo Cuenca Grau, Guillermo Valle-Pérez, Ard A. Louis
University of Oxford, University of Oxford, University of Oxford, University of Oxford
Abstract:
Doubledescent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. In this paper, we use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent perturbation of the spectrum of the neural network Gaussian process (NNGP) kernel, thus establishing a novel connection between the NNGP literature and the random matrix theory literature in the context of neural networks. Our analytical expressions allow us to explore the generalisation behavior of the corresponding kernel and GP regression. Furthermore, they offer a new interpretation of double-descent in terms of the discrepancy between the width-dependent empirical kernel and the width-independent NNGP kernel.



Paperid:1318
Authors:Karthik Elamvazhuthi, Xuechen Zhang, Matthew Jacobs, Samet Oymak, Fabio Pasqualetti
University of California, Riverside, University of California, Riverside, University of California, Santa Barbara, University of Michigan, University of California, Riverside
Abstract:
Score matching based diffusion has shown to achieve the state of art results in generation modeling. In the original score matching based diffusion algorithm, the forward equation is a differential equation for which the probability density equation evolves according to a linear partial differential equation, the FokkerPlanck equation. A drawback of this approach is that one needs the data distribution to have a Lipschitz logarithmic gradient. This excludes a large class of data distributions that have a compact support. We present a deterministic diffusion process for which the vector fields are always Lipschitz and hence the score does not explode for probability measures with compact support. This deterministic diffusion process can be seen as a regularization of the porous media equation equation, which enables one to guarantee long term convergence of the forward process to the noise distribution. Though the porous media equation is itself not always guaranteed to have a Lipschitz vector field, it can be used to understand the closeness of the output of the algorithm to the data distribution as a function of the the time horizon and score matching error. This analysis enables us to show that the algorithm has better dependence on the score matching error than approaches based on stochastic diffusions. Using numerical experiments we verify our theoretical results on example one and two dimensional data distributions which are compactly supported. Additionally, we validate the approach on a modified MNIST data set for which the distribution is concentrated on a compact set. In each of the experiments, the approach using deterministic diffusion performs better that the diffusion algorithm with stochastic forward process, when considering the FID scores of the generated samples.



Paperid:1319
Authors:Moshe Eliasof, Eldad Haber, Eran Treister
University of Cambridge Ben-Gurion University of the Negev, University of British Columbia, Ben-Gurion University of the Negev
Abstract:
Graph neural networks (GNNs) have shown remarkable success in learning representations for graphstructured data. However, GNNs still face challenges in modeling complex phenomena that involve feature transportation. In this paper, we propose a novel GNN architecture inspired by Advection-Diffusion-Reaction systems, called ADR-GNN. Advection models feature transportation, while diffusion captures the local smoothing of features, and reaction represents the non-linear transformation between feature channels. We provide an analysis of the qualitative behavior of ADR-GNN, that shows the benefit of combining advection, diffusion, and reaction. To demonstrate its efficacy, we evaluate ADR-GNN on real-world node classification and spatio-temporal datasets, and show that it improves or offers competitive performance compared to state-of-the-art networks.



Paperid:1320
Authors:Katharina Ensinger, Nicholas Tagliapietra, Sebastian Ziesche, Sebastian Trimpe
Bosch Center for Artificial Intelligence, Renningen, Germany Institute for Data Science in Mechanical Engineering, RWTH Aachen University, Bosch Center for Artificial Intelligence, Renningen, Germany, Bosch Center for Artificial Intelligence, Renningen, Germany, Institute for Data Science in Mechanical Engineering, RWTH Aachen University
Abstract:
Many physical systems can be described as a continuoustime dynamical system. In practice, the true system is often unknown and has to be learned from measurement data. Since data is typically collected in discrete time, e.g. by sensors, most methods in Gaussian process (GP) dynamics model learning are trained on one-step ahead predictions. While this scheme is mathematically tempting, it can become problematic in several scenarios, e.g. if measurements are provided at irregularly-sampled time steps or physical system properties have to be conserved. Thus, we aim for a GP model of the true continuous-time dynamics. We tackle this task by leveraging higher-order numerical integrators. These integrators provide the necessary tools to discretize dynamical systems with arbitrary accuracy. However, most higher-order integrators require dynamics evaluations at intermediate time steps, making exact GP inference intractable. In previous work, this problem is often addressed by approximate inference techniques. However, exact GP inference is preferable in many scenarios, e.g. due to its mathematical guarantees. In order to enable direct inference, we propose to leverage multistep and Taylor integrators. We demonstrate how exact inference schemes can be derived for these types of integrators. Further, we derive tailored sampling schemes that allow one to draw consistent dynamics functions from the posterior. The learned model can thus be integrated with arbitrary integrators, just like a standard dynamical system. We show empirically and theoretically that our approach yields an accurate representation of the continuous-time system.



Paperid:1321
Authors:Katharina Ensinger, Sebastian Ziesche, Sebastian Trimpe
Bosch Center for Artificial Intelligence, Renningen, Germany Institute for Data Science in Mechanical Engineering, RWTH Aachen University, Bosch Center for Artificial Intelligence, Renningen, Germany, Institute for Data Science in Mechanical Engineering, RWTH Aachen University
Abstract:
Dynamics model learning deals with the task of inferring unknown dynamics from measurement data and predicting the future behavior of the system. A typical approach to address this problem is to train recurrent models. However, predictions with these models are often not physically meaningful. Further, they suffer from deteriorated behavior over time due to accumulating errors. Often, simulators building on first principles are available being physically meaningful by design. However, modeling simplifications typically cause inaccuracies in these models. Consequently, hybrid modeling is an emerging trend that aims to combine the best of both worlds. In this paper, we propose a new approach to hybrid modeling, where we inform the latent states of a learned model via a blackbox simulator. This allows to control the predictions via the simulator preventing them from accumulating errors. This is especially challenging since, in contrast to previous approaches, access to the simulator's latent states is not available. We tackle the task by leveraging observers, a well-known concept from control theory, inferring unknown latent states from observations and dynamics over time. In our learning-based setting, we jointly learn the dynamics and an observer that infers the latent states via the simulator. Thus, the simulator constantly corrects the latent states, compensating for modeling mismatch caused by learning. To maintain flexibility, we train an RNN-based residuum for the latent states that cannot be informed by the simulator.



Paperid:1322
Authors:Deividas Eringis, John Leth, Zheng-Hua Tan, Rafael Wisniewski, Mihály Petreczky
Department of Electronic Systems, Aalborg University, Department of Electronic Systems, Aalborg University, Department of Electronic Systems, Aalborg University, Department of Electronic Systems, Aalborg University, Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL
Abstract:
In this paper, we derive a PACBayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases. In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN.



Paperid:1323
Authors:Pascal Esser, Maximilian Fleissner, Debarghya Ghoshdastidar
Technical University of Munich, Technical University of Munich, Technical University of Munich
Abstract:
Unsupervised and selfsupervised representation learning has become popular in recent years for learning useful features from unlabelled data. Representation learning has been mostly developed in the neural network literature, and other models for representation learning are surprisingly unexplored. In this work, we introduce and analyze several kernel-based representation learning approaches: Firstly, we define two kernel Self-Supervised Learning (SSL) models using contrastive loss functions and secondly, a Kernel Autoencoder (AE) model based on the idea of embedding and reconstructing data. We argue that the classical representer theorems for supervised kernel machines are not always applicable for (self-supervised) representation learning, and present new representer theorems, which show that the representations learned by our kernel models can be expressed in terms of kernel matrices. We further derive generalisation error bounds for representation learning with kernel SSL and AE, and empirically evaluate the performance of these methods in both small data regimes as well as in comparison with neural network based models.



Paperid:1324
Authors:Xiaolong Fan, Maoguo Gong, Yue Wu, Zedong Tang, Jieyi Liu
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Graph Structure Learning (GSL) has demonstrated considerable potential in the analysis of graphunknown non-Euclidean data across a wide range of domains. However, constructing an end-to-end graph structure learning model poses a challenge due to the impediment of gradient flow caused by the nearest neighbor sampling strategy. In this paper, we construct a differential graph structure learning model by replacing the non-differentiable nearest neighbor sampling with a differentiable sampling using the reparameterization trick. Under this framework, we argue that the act of sampling nearest neighbors may not invariably be essential, particularly in instances where node features exhibit a significant degree of similarity. To alleviate this issue, the bell-shaped Gaussian Similarity (GauSim) modeling is proposed to sample non-nearest neighbors. To adaptively model the similarity, we further propose Neural Gaussian Similarity (NeuralGauSim) with learnable parameters featuring flexible sampling behaviors. In addition, we develop a scalable method by transferring the large-scale graph to the transition graph to significantly reduce the complexity. Experimental results demonstrate the effectiveness of the proposed methods.



Paperid:1325
Authors:Yan Fan, Yu Wang, Pengfei Zhu, Qinghua Hu
Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University Haihe Laboratory of Information Technology Application Innovation, Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University Haihe Laboratory of Information Technology Application Innovation, Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University Haihe Laboratory of Information Technology Application Innovation, Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University Haihe Laboratory of Information Technology Application Innovation
Abstract:
Continual learning (CL) has shown promising results and comparable performance to learning at once in a fully supervised manner. However, CL strategies typically require a large number of labeled samples, making their reallife deployment challenging. In this work, we focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories. We provide a comprehensive analysis of SSCL and demonstrate that unreliable distributions of unlabeled data lead to unstable training and refinement of the progressing stages. This problem severely impacts the performance of SSCL. To address the limitations, we propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning, which leverages both semantic and structural information to achieve more stable knowledge distillation on unlabeled data and exhibit robustness against distribution bias. Firstly, we formalize a general model of structural distillation and design a dynamic graph construction for the continual learning progress. Next, we define a structure distillation vector and design a dynamic sub-graph distillation algorithm, which enables end-to-end training and adaptability to scale up tasks. The entire proposed method is adaptable to various CL methods and supervision settings. Finally, experiments conducted on three datasets CIFAR10, CIFAR100, and ImageNet-100, with varying supervision ratios, demonstrate the effectiveness of our proposed approach in mitigating the catastrophic forgetting problem in semi-supervised continual learning scenarios. Our code is available: https://github.com/fanyan0411/DSGD.



Paperid:1326
Authors:Yunqian Fan, Xiuying Wei, Ruihao Gong, Yuqing Ma, Xiangguo Zhang, Qi Zhang, Xianglong Liu
School of Information Science and Technology, ShanghaiTech University SenseTime Research, SenseTime Research, State Key Laboratory of Complex & Critical Software Environment, Beihang University, Beijing, China SenseTime Research, Institute of Artificial Intelligence, Beihang University, Beijing, China State Key Laboratory of Complex & Critical Software Environment, Beihang University, Beijing, China, SenseTime Research, SenseTime Research, State Key Laboratory of Complex & Critical Software Environment, Beihang University, Beijing, China
Abstract:
Lane detection (LD) plays a crucial role in enhancing the L2+ capabilities of autonomous driving, capturing widespread attention. The PostProcessing Quantization (PTQ) could facilitate the practical application of LD models, enabling fast speeds and limited memories without labeled data. However, prior PTQ methods do not consider the complex LD outputs that contain physical semantics, such as offsets, locations, etc., and thus cannot be directly applied to LD models. In this paper, we pioneeringly investigate semantic sensitivity to post-processing for lane detection with a novel Lane Distortion Score. Moreover, we identify two main factors impacting the LD performance after quantization, namely intra-head sensitivity and inter-head sensitivity, where a small quantization error in specific semantics can cause significant lane distortion. Thus, we propose a Selective Focus framework deployed with Semantic Guided Focus and Sensitivity Aware Selection modules, to incorporate post-processing information into PTQ reconstruction. Based on the observed intra-head sensitivity, Semantic Guided Focus is introduced to prioritize foreground-related semantics using a practical proxy. For inter-head sensitivity, we present Sensitivity Aware Selection, efficiently recognizing influential prediction heads and refining the optimization objectives at runtime. Extensive experiments have been done on a wide variety of models including keypoint-, anchor-, curve-, and segmentation-based ones. Our method produces quantized models in minutes on a single GPU and can achieve 6.4\% F1 Score improvement on the CULane dataset. Code and supplementary statement can be found at https://github.com/PannenetsF/SelectiveFocus.



Paperid:1327
Authors:Junpeng Fang, Gongduo Zhang, Qing Cui, Caizhi Tang, Lihong Gu, Longfei Li, Jinjie Gu, Jun Zhou
Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group
Abstract:
Accurate prediction of coupon usage is crucial for promoting user consumption through targeted coupon recommendations. However, in realworld coupon recommendations, the coupon allocation process is not solely determined by the model trained with the history interaction data but is also interfered with by marketing tactics desired to fulfill specific commercial goals.This interference creates an imbalance in the interactions, which causes the data to deviate from the user's natural preferences. We refer to this deviation as the matching bias. Such biased interaction data affects the efficacy of the model, and thus it is necessary to employ debiasing techniques to prevent any negative impact. We investigate the mitigation of matching bias in coupon recommendations from a causal-effect perspective. By treating the attributes of users and coupons associated with marketing tactics as confounders, we find the confounders open the backdoor path between user-coupon matching and the conversion, which introduces spurious correlation. To remove the bad effect, we propose a novel training paradigm named Backdoor Adjustment via Group Adaptation (BAGA) for debiased coupon recommendations, which performs intervened training and inference, i.e., separately modeling each user-coupon group pair. However, modeling all possible group pairs greatly increases the computational complexity and cost. To address the efficiency challenge, we further present a simple but effective dual-tower multi-task framework and leverage the Customized Gate Control (CGC) model architecture, which separately models each user and coupon group with a separate expert module. We instantiate BAGA on five representative models: FM, DNN, NCF, MASKNET, and DEEPFM, and conduct comprehensive offline and online experiments to demonstrate the efficacy of our proposed paradigm.



Paperid:1328
Authors:Yujie Fang, Xin Li, Qianyu Chen, Mingzhong Wang
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, The University of the Sunshine Coast
Abstract:
The widespread adoption of Graph Neural Networks (GNNs) has led to an increasing focus on their reliability. To address the issue of underconfidence in GNNs, various calibration methods have been developed to gain notable reductions in calibration error. However, we observe that existing approaches generally fail to enhance consistently, and in some cases even deteriorate, GNNs' ability to discriminate between correct and incorrect predictions. In this study, we advocate the significance of discriminative ability and the inclusion of relevant evaluation metrics. Our rationale is twofold: 1) Overlooking discriminative ability can inadvertently compromise the overall quality of the model; 2) Leveraging discriminative ability can significantly inform and improve calibration outcomes. Therefore, we thoroughly explore the reasons why existing calibration methods have ineffectiveness and even degradation regarding the discriminative ability of GNNs. Building upon these insights, we conduct GNN calibration experiments across multiple datasets using a straightforward example model, denoted as DC(GNN). Its excellent performance confirms the potential of integrating discriminative ability as a key consideration in the calibration of GNNs, thereby establishing a pathway toward more effective and reliable network calibration.



Paperid:1329
Authors:Jiaheng Feng, Mingxiao Feng, Haolin Song, Wengang Zhou, Houqiang Li
EEIS Department, University of Science and Technology of China, EEIS Department, University of Science and Technology of China, EEIS Department, University of Science and Technology of China, EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, EEIS Department, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Abstract:
Offlineto-online reinforcement learning (RL) provides a promising solution to improving suboptimal offline pre-trained policies through online fine-tuning. However, one efficient method, unconstrained fine-tuning, often suffers from severe policy collapse due to excessive distribution shift. To ensure stability, existing methods retain offline constraints and employ additional techniques during fine-tuning, which hurts efficiency. In this work, we introduce a novel perspective: eliminating the policy collapse without imposing constraints. We observe that such policy collapse arises from the mismatch between unconstrained fine-tuning and the conventional RL training framework. To this end, we propose Stabilized Unconstrained Fine-tuning (SUF), a streamlined framework that benefits from the efficiency of unconstrained fine-tuning while ensuring stability by modifying the Update-To-Data ratio. With just a few lines of code adjustments, SUF demonstrates remarkable adaptability to diverse backbones and superior performance over state-of-the-art baselines.



Paperid:1330
Authors:Qianhan Feng, Lujing Xie, Shijie Fang, Tong Lin
National Key Laboratory of General Artificial Intelligence, China School of Intelligence Science and Technology, Peking University, Yuanpei College, Peking University, School of Intelligence Science and Technology, Peking University Google, Shanghai, China, National Key Laboratory of General Artificial Intelligence, China School of Intelligence Science and Technology, Peking University
Abstract:
Semisupervised Learning (SSL) reduces the need for extensive annotations in deep learning, but the more realistic challenge of imbalanced data distribution in SSL remains largely unexplored. In Class Imbalanced Semi-supervised Learning (CISSL), the bias introduced by unreliable pseudo-labels can be exacerbated by imbalanced data distributions. Most existing methods address this issue at instance-level through reweighting or resampling, but the performance is heavily limited by their reliance on biased backbone representation. Some other methods do perform feature-level adjustments like feature blending but might introduce unfavorable noise. In this paper, we discuss the bonus of a more balanced feature distribution for the CISSL problem, and further propose a Balanced Feature-Level Contrastive Learning method (BaCon). Our method directly regularizes the distribution of instances' representations in a well-designed contrastive manner. Specifically, class-wise feature centers are computed as the positive anchors, while negative anchors are selected by a straightforward yet effective mechanism. A distribution-related temperature adjustment is leveraged to control the class-wise contrastive degrees dynamically. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR10-LT, CIFAR100-LT, STL10-LT, and SVHN-LT datasets across various settings. For example, BaCon surpasses instance-level method FixMatch-based ABC on CIFAR10-LT with a 1.21% accuracy improvement, and outperforms state-of-the-art feature-level method CoSSL on CIFAR100-LT with a 0.63% accuracy improvement. When encountering more extreme imbalance degree, BaCon also shows better robustness than other methods.



Paperid:1331
Authors:Shibo Feng, Chunyan Miao, Zhong Zhang, Peilin Zhao
School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore; Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore; Webank-NTU Joint Research Institute on Fintech, NTU, Singapore, School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore; Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore; Webank-NTU Joint Research Institute on Fintech, NTU, Singapore, Tencent AI Lab, Shenzhen, China, Tencent AI Lab, Shenzhen, China
Abstract:
The probability prediction of multivariate time series is a notoriously challenging but practical task. This research proposes to condense highdimensional multivariate time series forecasting into a problem of latent space time series generation, to improve the expressiveness of each timestamp and make forecasting more manageable. To solve the problem that the existing work is hard to extend to high-dimensional multivariate time series, we present a latent multivariate time series diffusion framework called Latent Diffusion Transformer (LDT), which consists of a symmetric statistics-aware autoencoder and a diffusion-based conditional generator, to implement this idea. Through careful design, the time series autoencoder can compress multivariate timestamp patterns into a concise latent representation by considering dynamic statistics. Then, the diffusion-based conditional generator is able to efficiently generate realistic multivariate timestamp values on a continuous latent space under a novel self-conditioning guidance which is modeled in a non-autoregressive way. Extensive experiments demonstrate that our model achieves state-of-the-art performance on many popular high-dimensional multivariate time series datasets.



Paperid:1332
Authors:Wei Feng, Guoshuai Sheng, Qianqian Wang, Quanxue Gao, Zhiqiang Tao, Bo Dong
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi, China, 710049, School of Telecommunications Engineering, Xidian University, Xi'an, Shaanxi, China, 710071, School of Telecommunications Engineering, Xidian University, Xi'an, Shaanxi, China, 710071, School of Telecommunications Engineering, Xidian University, Xi'an, Shaanxi, China, 710071, Rochester Institute of Technology, School of Continuing Education, Xi’an Jiaotong University, Xi’an, Shaanxi, China, 710049
Abstract:
Partial multiview clustering is a challenging and practical research problem for data analysis in real-world applications, due to the potential data missing issue in different views. However, most existing methods have not fully explored the correlation information among various incomplete views. In addition, these existing clustering methods always ignore discovering discriminative features inside the data itself in this unsupervised task. To tackle these challenges, we propose Partial Multi-View Clustering via Self-Supervised \textbf{N}etwork (PVC-SSN) in this paper. Specifically, we employ contrastive learning to obtain a more discriminative and consistent subspace representation, which is guided by a self-supervised module. Self-supervised learning can exploit effective cluster information through the data itself to guide the learning process of clustering tasks. Thus, it can pull together embedding features from the same cluster and push apart these from different clusters. Extensive experiments on several benchmark datasets show that the proposed PVC-SCN method outperforms several state-of-the-art clustering methods.



Paperid:1333
Authors:Jan Finkbeiner, Thomas Gmeinder, Mark Pupilli, Alexander Titterton, Emre Neftci
Research Center Juelich RWTH Aachen, Graphcore, Graphcore, Graphcore, Research Center Juelich RWTH Aachen
Abstract:
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel workloads and dense vector matrix multiplications. Potentially more efficient neural network models utilizing sparsity and recurrence cannot leverage the full power of SIMD processor and are thus at a severe disadvantage compared to today's prominent parallel architectures like Transformers and CNNs, thereby hindering the path towards more sustainable AI. To overcome this limitation, we explore sparse and recurrent model training on a massively parallel multiple instruction multiple data (MIMD) architecture with distributed local memory. We implement a training routine based on backpropagation though time (BPTT) for the braininspired class of Spiking Neural Networks (SNNs) that feature binary sparse activations. We observe a massive advantage in using sparse activation tensors with a MIMD processor, the Intelligence Processing Unit (IPU) compared to GPUs. On training workloads, our results demonstrate 5-10x throughput gains compared to A100 GPUs and up to 38x gains for higher levels of activation sparsity, without a significant slowdown in training convergence or reduction in final model performance. Furthermore, our results show highly promising trends for both single and multi IPU configurations as we scale up to larger model sizes. Our work paves the way towards more efficient, non-standard models via AI training hardware beyond GPUs, and competitive large scale SNN models.



Paperid:1334
Authors:Stefano Fiorini, Stefano Coniglio, Michele Ciavotta, Enza Messina
Italian Institute of Technology, University of Bergamo, University of Milano-Bicocca, University of Milano-Bicocca
Abstract:
We introduce QuaterGCN, a spectral Graph Convolutional Network (GCN) with quaternionvalued weights at whose core lies the Quaternionic Laplacian, a quaternion-valued Laplacian matrix by whose proposal we generalize two widely-used Laplacian matrices: the classical Laplacian (defined for undirected graphs) and the complex-valued Sign-Magnetic Laplacian (proposed within the spectral GCN SigMaNet to handle digraphs with weights of arbitrary sign). In addition to its generality, QuaterGCN is the only Laplacian to completely preserve the (di)graph topology that we are aware of, as it can handle graphs and digraphs containing antiparallel pairs of edges (digons) of different weight without reducing them to a single (directed or undirected) edge as done by other Laplacians. Experimental results show the superior performance of QuaterGCN compared to other state-of-the-art GCNs, particularly in scenarios where the information the digons carry is crucial to successfully address the task at hand.



Paperid:1335
Authors:Dana Fisman, Noa Izsak, Swen Jacobs
Ben-Gurion University, Ben-Gurion University, CISPA Helmholtz Center for Information Security
Abstract:
The problem of learning a computational model from examples has been receiving growing attention. For the particularly challenging problem of learning models of distributed systems, existing results are restricted to models with a fixed number of interacting processes. In this work we look for the first time (to the best of our knowledge) at the problem of learning a distributed system with an arbitrary number of processes, assuming only that there exists a cutoff, i.e., a number of processes that is sufficient to produce all observable behaviors. Specifically, we consider fine broadcast protocols, these are broadcast protocols (BPs) with a finite cutoff and no hidden states. We provide a learning algorithm that can infer a correct BP from a sample that is consistent with a fine BP, and a minimal equivalent BP if the sample is sufficiently complete. On the negative side we show that (a) characteristic sets of exponential size are unavoidable, (b) the consistency problem for fine BPs is NP hard, and (c) that fine BPs are not polynomially predictable.



Paperid:1336
Authors:Manon Flageat, Bryan Lim, Antoine Cully
Imperial College London, Imperial College London, Imperial College London
Abstract:
Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one rollout to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.



Paperid:1337
Authors:Kei Sen Fong, Mehul Motani
National University of Singapore, National University of Singapore
Abstract:
We introduce a conceptually simple yet effective method to create small, compact decision trees by using splits found via Symbolic Regression (SR). Traditional decision tree (DT) algorithms partition a dataset on axis-parallel splits. When the true boundaries are not along the feature axes, DT is likely to have a complicated structure and a dense decision boundary. In this paper, we introduce SR-Enhanced DT (SREDT) - a method which utilizes SR to increase the richness of the class of possible DT splits. We evaluate SREDT on both synthetic and real-world datasets. Despite its simplicity, our method produces surprisingly small trees that outperform both DT and oblique DT (ODT) on supervised classification tasks in terms of accuracy and F-score. We show empirically that SREDTs decrease inference time (compared to DT and ODT) and argue that they allow us to obtain more explainable descriptions of the decision process. SREDT also performs competitively against state-of-the-art tabular classification methods, including tree ensembles and deep models. Finally, we introduce a local search mechanism to improve SREDT and evaluate it on 56 PMLB datasets. This mechanism shows improved performance on 77.2% of the datasets, outperforming DT and ODT. In terms of F-Score, local SREDT outperforms DT and ODT in 82.5% and 73.7% of the datasets respectively and in terms of inference time, local SREDT requires 25.8% and 26.6% less inference time than DT and ODT respectively.



Paperid:1338
Authors:Jack Foster, Stefan Schoepf, Alexandra Brintrup
University of Cambridge Alan Turing Institute, University of Cambridge, University of Cambridge Alan Turing Institute
Abstract:
Machine unlearning, the ability for a machine learning model to forget, is becoming increasingly important to comply with data privacy regulations, as well as to remove harmful, manipulated, or outdated information. The key challenge lies in forgetting specific information while protecting model performance on the remaining data. While current stateof-the-art methods perform well, they typically require some level of retraining over the retained data, in order to protect or restore model performance. This adds computational overhead and mandates that the training data remain available and accessible, which may not be feasible. In contrast, other methods employ a retrain-free paradigm, however, these approaches are prohibitively computationally expensive and do not perform on par with their retrain-based counterparts. We present Selective Synaptic Dampening (SSD), a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data. First, SSD uses the Fisher information matrix of the training and forgetting data to select parameters that are disproportionately important to the forget set. Second, SSD induces forgetting by dampening these parameters proportional to their relative importance to the forget set with respect to the wider training data. We evaluate our method against several existing unlearning methods in a range of experiments using ResNet18 and Vision Transformer. Results show that the performance of SSD is competitive with retrain-based post hoc methods, demonstrating the viability of retrain-free post hoc unlearning approaches.



Paperid:1339
Authors:Fares Fourati, Christopher John Quinn, Mohamed-Slim Alouini, Vaneet Aggarwal
King Abdullah University of Science and Technology (KAUST), Iowa State University, King Abdullah University of Science and Technology (KAUST), Purdue University King Abdullah University of Science and Technology (KAUST)
Abstract:
We propose a novel combinatorial stochasticgreedy bandit (SGB) algorithm for combinatorial multi-armed bandit problems when no extra information other than the joint reward of the selected set of n arms at each time step t in [T] is observed. SGB adopts an optimized stochastic-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms. Unlike existing methods that explore the entire set of unselected base arms during each selection step, our SGB algorithm samples only an optimized proportion of unselected arms and selects actions from this subset. We prove that our algorithm achieves a (1-1/e)-regret bound of O(n^(1/3) k^(2/3) T^(2/3) log(T)^(2/3)) for monotone stochastic submodular rewards, which outperforms the state-of-the-art in terms of the cardinality constraint k. Furthermore, we empirically evaluate the performance of our algorithm in the context of online constrained social influence maximization. Our results demonstrate that our proposed approach consistently outperforms the other algorithms, increasing the performance gap as k grows.



Paperid:1340
Authors:Feisi Fu, Zhilu Wang, Weichao Zhou, Yixuan Wang, Jiameng Fan, Chao Huang, Qi Zhu, Xin Chen, Wenchao Li
Boston University, Northwestern University, Boston University, Northwestern University, Boston University, Univeristy of Liverpool, Northwestern University, University of New Mexico, Boston University
Abstract:
We present REGLO, a novel methodology for repairing pretrained neural networks to satisfy global robustness and individual fairness properties. A neural network is said to be globally robust with respect to a given input region if and only if all the input points in the region are locally robust. This notion of global robustness also captures the notion of individual fairness as a special case. We prove that any counterexample to a global robustness property must exhibit a corresponding large gradient. For ReLU networks, this result allows us to efficiently identify the linear regions that violate a given global robustness property. By formulating and solving a suitable robust convex optimization problem, REGLO then computes a minimal weight change that will provably repair these violating linear regions.



Paperid:1341
Authors:Minghan Fu, Fang-Xiang Wu
University of Saskatchewan, University of Saskatchewan
Abstract:
The learning rate is a critical hyperparameter for deep learning tasks since it determines the extent to which the model parameters are adjusted during the learning course. However, the choice of learning rates typically depends on empirical judgment, which may not result in satisfactory outcomes without intensive tryand-error experiments. In this study, we propose a novel learning rate adaptation scheme called QLABGrad. Without any user-specified hyperparameter, QLABGrad automatically determines the learning rate by optimizing the quadratic loss approximation-based (QLAB) function for a given gradient descent direction, where only one extra forward propagation is required. We theoretically prove the convergence of QLABGrad under the smooth Lipschitz condition on the loss function. Experiment results on multiple architectures, including MLP, CNN, and ResNet, on MNIST, CIFAR10, and ImageNet datasets, demonstrate that QLABGrad outperforms widely adopted schemes for deep learning.



Paperid:1342
Authors:Minghao Fu, Ke Zhu, Jianxin Wu
National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Abstract:
When pretrained models become rapidly larger, the cost of fine-tuning on downstream tasks steadily increases, too. To economically fine-tune these models, parameter-efficient transfer learning (PETL) is proposed, which only tunes a tiny subset of trainable parameters to efficiently learn quality representations. However, current PETL methods are facing the dilemma that during training the GPU memory footprint is not effectively reduced as trainable parameters. PETL will likely fail, too, if the full fine-tuning encounters the out-of-GPU-memory issue. This phenomenon happens because trainable parameters from these methods are generally entangled with the backbone, such that a lot of intermediate states have to be stored in GPU memory for gradient propagation. To alleviate this problem, we introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN). By progressively extracting task-specific information with a few low-rank linear mappings and appropriately adding the information back to the backbone, CSN effectively realizes knowledge transfer in various downstream tasks. We conducted extensive experiments to validate the effectiveness of our method. The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy, achieving new state-of-the-art on several standard benchmarks.



Paperid:1343
Authors:Zhongtian Fu, Kefei Song, Luping Zhou, Yang Yang
Nanjing University of Science and Technology, Nanjing 210094, China, Nanjing University of Science and Technology, Nanjing 210094, China, The University of Sydney, Sydney 2052, Australia, Nanjing University of Science and Technology, Nanjing 210094, China
Abstract:
Image captioning aims to automatically generate captions for images by learning a crossmodal generator from vision to language. The large amount of image-text pairs required for training is usually sourced from the internet due to the manual cost, which brings the noise with mismatched relevance that affects the learning process. Unlike traditional noisy label learning, the key challenge in processing noisy image-text pairs is to finely identify the mismatched words to make the most use of trustworthy information in the text, rather than coarsely weighing the entire examples. To tackle this challenge, we propose a Noise-aware Image Captioning method (NIC) to adaptively mitigate the erroneous guidance from noise by progressively exploring mismatched words. Specifically, NIC first identifies mismatched words by quantifying word-label reliability from two aspects: 1) inter-modal representativeness, which measures the significance of the current word by assessing cross-modal correlation via prediction certainty; 2) intra-modal informativeness, which amplifies the effect of current prediction by combining the quality of subsequent word generation. During optimization, NIC constructs the pseudo-word-labels considering the reliability of the origin word-labels and model convergence to periodically coordinate mismatched words. As a result, NIC can effectively exploit both clean and noisy image-text pairs to learn a more robust mapping function. Extensive experiments conducted on the MS-COCO and Conceptual Caption datasets validate the effectiveness of our method in various noisy scenarios.



Paperid:1344
Authors:Harmender Gahlawat, Meirav Zehavi
Ben-Gurion University, Beersheba, Ben-Gurion University, Beersheba
Abstract:
Decision trees is a fundamental tool in machine learning for representing, classifying, and generalizing data. It is desirable to construct ``small'' decision trees, by minimizing either the size (s) or the depth (d) of the decision tree (DT). Recently, the parameterized complexity of Decision Tree Learning has attracted a lot of attention. We consider a generalization of Decision Tree Learning where given a classification instance E and an integer t, the task is to find a ``small'' DT that disagrees with E in at most t examples. We consider two problems: DTSO and DTDO, where the goal is to construct a DT minimizing s and d, respectively. We first establish that both DTSO and DTDO are W[1]hard when parameterized by s+y and d+y, respectively, where y is the maximum number of features in which two differently labeled examples can differ. We complement this result by showing that these problems become FPT if we include the parameter t. We also consider the kernelization complexity of these problems and establish several positive and negative results for both DTSO and DTDO.



Paperid:1345
Authors:Filippo Galli, Catuscia Palamidessi, Tommaso Cucinotta
Scuola Normale Superiore, Pisa, Italy Scuola Superiore Sant'Anna, Pisa, Italy, INRIA, Palaiseau, France École Polytechnique, Palaiseau, France, Scuola Superiore Sant'Anna, Pisa, Italy
Abstract:
Training differentially private machine learning models requires constraining an individual's contribution to the optimization process. This is achieved by clipping the 2norm of their gradient at a predetermined threshold prior to averaging and batch sanitization. This selection adversely influences optimization in two opposing ways: it either exacerbates the bias due to excessive clipping at lower values, or augments sanitization noise at higher values. The choice significantly hinges on factors such as the dataset, model architecture, and even varies within the same optimization, demanding meticulous tuning usually accomplished through a grid search. In order to circumvent the privacy expenses incurred in hyperparameter tuning, we present a novel approach to dynamically optimize the clipping threshold. We treat this threshold as an additional learnable parameter, establishing a clean relationship between the threshold and the cost function. This allows us to optimize the former with gradient descent, with minimal repercussions on the overall privacy analysis. Our method is thoroughly assessed against alternative fixed and adaptive strategies across diverse datasets, tasks, model dimensions, and privacy levels. Our results indicate that it performs comparably or better in the evaluated scenarios, given the same privacy requirements.



Paperid:1346
Authors:Alireza Ganjdanesh, Shangqian Gao, Hirad Alipanah, Heng Huang
University of Maryland, College Park, University of Pittsburgh, University of Pittsburgh, University of Maryland, College Park
Abstract:
Generative Adversarial Networks (GANs) have shown remarkable success in modeling complex data distributions for imageto-image translation. Still, their high computational demands prohibit their deployment in practical scenarios like edge devices. Existing GAN compression methods mainly rely on knowledge distillation or convolutional classifiers' pruning techniques. Thus, they neglect the critical characteristic of GANs: their local density structure over their learned manifold. Accordingly, we approach GAN compression from a new perspective by explicitly encouraging the pruned model to preserve the density structure of the original parameter-heavy model on its learned manifold. We facilitate this objective for the pruned model by partitioning the learned manifold of the original generator into local neighborhoods around its generated samples. Then, we propose a novel pruning objective to regularize the pruned model to preserve the local density structure over each neighborhood, resembling the kernel density estimation method. Also, we develop a collaborative pruning scheme in which the discriminator and generator are pruned by two pruning agents. We design the agents to capture interactions between the generator and discriminator by exchanging their peer's feedback when determining corresponding models' architectures. Thanks to such a design, our pruning method can efficiently find performant sub-networks and can maintain the balance between the generator and discriminator more effectively compared to baselines during pruning, thereby showing more stable pruning dynamics. Our experiments on image translation GAN models, Pix2Pix and CycleGAN, with various benchmark datasets and architectures demonstrate our method's effectiveness.



Paperid:1347
Authors:Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Rui Kong, Zongzhang Zhang, Yang Yu
National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Abstract:
Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ insample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.



Paperid:1348
Authors:Ge Gao, Xi Yang, Min Chi
North Carolina State University, IBM Research, North Carolina State University
Abstract:
Reinforcement learning (RL) is broadly employed in humaninvolved systems to enhance human outcomes. Off-policy evaluation (OPE) has been pivotal for RL in those realms since online policy learning and evaluation can be high-stake. Intelligent tutoring has raised tremendous attentions as highly challenging when applying OPE to human-involved systems, due to that students' subgroups can favor different pedagogical policies and the costly procedure that policies have to be induced fully offline and then directly deployed to the upcoming semester. In this work, we formulate on-demand pedagogical policy selection (ODPS) to tackle the challenges for OPE in intelligent tutoring. We propose a pipeline, EduPlanner, as a concrete solution for ODPS. Our pipeline results in an theoretically unbiased estimator, and enables efficient and customized policy selection by identifying subgroups over both historical data and on-arrival initial logs. We evaluate our approach on the Probability ITS that has been used in real classrooms for over eight years. Our study shows significant improvement on learning outcomes of students with EduPlanner, especially for the ones associated with low-performing subgroups.



Paperid:1349
Authors:Hang Gao, Chengyu Yao, Jiangmeng Li, Lingyu Si, Yifan Jin, Fengge Wu, Changwen Zheng, Huaping Liu
Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences State Key Laboratory of Intelligent Game, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Tsinghua University
Abstract:
Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graphstructured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conducting an in-depth analysis specifically targeting the causal modeling prowess of GNNs remains an unresolved issue. In order to comprehensively analyze various GNN models from a causal learning perspective, we constructed an artificially synthesized dataset with known and controllable causal relationships between data and labels. The rationality of the generated data is further ensured through theoretical foundations. Drawing insights from analyses conducted using our dataset, we introduce a lightweight and highly adaptable GNN module designed to strengthen GNNs' causal learning capabilities across a diverse range of tasks. Through a series of experiments conducted on both synthetic datasets and other real-world datasets, we empirically validate the effectiveness of the proposed module. The codes are available at https://github.com/yaoyao-yaoyao-cell/CRCG.



Paperid:1350
Authors:Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
Nanjing University, Nanjing University, Nanjing University, The Chinese University of Hong Kong, Nanjing University
Abstract:
Audiovisual segmentation (AVS) aims to locate and segment the sounding objects in a given video, which demands audio-driven pixel-level scene understanding. The existing methods cannot fully process the fine-grained correlations between audio and visual cues across various situations dynamically. They also face challenges in adapting to complex scenarios, such as evolving audio, the coexistence of multiple objects, and more. In this paper, we propose AVSegFormer, a novel framework for AVS that leverages the transformer architecture. Specifically, It comprises a dense audio-visual mixer, which can dynamically adjust interested visual features, and a sparse audio-visual decoder, which implicitly separates audio sources and automatically matches optimal visual features. Combining both components provides a more robust bidirectional conditional multi-modal representation, improving the segmentation performance in different scenarios. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.



Paperid:1351
Authors:Anne-Marie George, Christos Dimitrakakis
University of Oslo, Norway, University of Oslo, Norway University of Neuchatel, Switzerland
Abstract:
We formulate the problem of eliciting agents' preferences with the goal of finding a Kemeny ranking as a Dueling Bandits problem. Here the bandits' arms correspond to alternatives that need to be ranked and the feedback corresponds to a pairwise comparison between alternatives by a randomly sampled agent. We consider both sampling with and without replacement, i.e., the possibility to ask the same agent about some comparison multiple times or not. We find approximation bounds for Kemeny rankings dependant on confidence intervals over estimated winning probabilities of arms. Based on these we state algorithms to find Probably Approximately Correct (PAC) solutions and elaborate on their sample complexity for sampling with or without replacement. Furthermore, if all agents' preferences are strict rankings over the alternatives, we provide means to prune confidence intervals and thereby guide a more efficient elicitation. We formulate several adaptive sampling methods that use lookaheads to estimate how much confidence intervals (and thus approximation guarantees) might be tightened. All described methods are compared on synthetic data.



Paperid:1352
Authors:Joseph Giovanelli, Alexander Tornede, Tanja Tornede, Marius Lindauer
Alma Mater Studiorum - University of Bologna, Institute of Artificial Intelligence L3S Research Center Leibniz University Hannover, Institute of Artificial Intelligence L3S Research Center Leibniz University Hannover, Institute of Artificial Intelligence L3S Research Center Leibniz University Hannover
Abstract:
Hyperparameter optimization (HPO) is important to leverage the full potential of machine learning (ML). In practice, users are often interested in multiobjective (MO) problems, i.e., optimizing potentially conflicting objectives, like accuracy and energy consumption. To tackle this, the vast majority of MO-ML algorithms return a Pareto front of non-dominated machine learning models to the user. Optimizing the hyperparameters of such algorithms is non-trivial as evaluating a hyperparameter configuration entails evaluating the quality of the resulting Pareto front. In literature, there are known indicators that assess the quality of a Pareto front (e.g., hypervolume, R2) by quantifying different properties (e.g., volume, proximity to a reference point). However, choosing the indicator that leads to the desired Pareto front might be a hard task for a user. In this paper, we propose a human-centered interactive HPO approach tailored towards multi-objective ML leveraging preference learning to extract desiderata from users that guide the optimization. Instead of relying on the user guessing the most suitable indicator for their needs, our approach automatically learns an appropriate indicator. Concretely, we leverage pairwise comparisons of distinct Pareto fronts to learn such an appropriate quality indicator. Then, we optimize the hyperparameters of the underlying MO-ML algorithm towards this learned indicator using a state-of-the-art HPO approach. In an experimental study targeting the environmental impact of ML, we demonstrate that our approach leads to substantially better Pareto fronts compared to optimizing based on a wrong indicator pre-selected by the user, and performs comparable in the case of an advanced user knowing which indicator to pick.



Paperid:1353
Authors:Chengyue Gong, Xiaocong Du, Bhargav Bhushanam, Lemeng Wu, Xingchao Liu, Dhruv Choudhary, Arun Kejariwal, Qiang Liu
University of Texas at Austin, Meta, Inc., Meta, Inc., University of Texas at Austin, University of Texas at Austin, Meta, Inc., Meta, Inc., University of Texas at Austin
Abstract:
Very deep neural networks lead to significantly better performance on various real tasks. However, it usually causes slow inference and is hard to be deployed on realworld devices. How to reduce the number of layers to save memory and to accelerate the inference is an eye-catching topic. In this work, we introduce an intermediate objective, a continuous-time network, before distilling deep networks into shallow networks. First, we distill a given deep network into a continuous-time neural flow model, which can be discretized with an ODE solver and the inference requires passing through the network multiple times. By forcing the flow transport trajectory to be straight lines, we find that it is easier to compress the infinite step model into a one-step neural flow model, which only requires passing through the flow model once. Secondly, we refine the one-step flow model together with the final head layer with knowledge distillation and finally, we can replace the given deep network with this one-step flow network. Empirically, we demonstrate that our method outperforms direct distillation and other baselines on different model architectures (e.g. ResNet, ViT) on image classification and semantic segmentation tasks. We also manifest that our distilled model naturally serves as an early-exit dynamic inference model.



Paperid:1354
Authors:Ruihao Gong, Yang Yong, Zining Wang, Jinyang Guo, Xiuying Wei, Yuqing Ma, Xianglong Liu
State Key Laboratory of Complex & Critical Software Environment, Beihang University SenseTime Research, SenseTime Research, State Key Laboratory of Complex & Critical Software Environment, Beihang University, Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment, Beihang University, SenseTime Research, Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment, Beihang University, State Key Laboratory of Complex & Critical Software Environment, Beihang University
Abstract:
Neural network sparsity has attracted many research interests due to its similarity to biological schemes and high energy efficiency. However, existing methods depend on longtime training or fine-tuning, which prevents large-scale applications. Recently, some works focusing on post-training sparsity (PTS) have emerged. They get rid of the high training cost but usually suffer from distinct accuracy degradation due to neglect of the reasonable sparsity rate at each layer. Previous methods for finding sparsity rates mainly focus on the training-aware scenario, which usually fails to converge stably under the PTS setting with limited data and much less training cost. In this paper, we propose a fast and controllable post-training sparsity (FCPTS) framework. By incorporating a differentiable bridge function and a controllable optimization objective, our method allows for rapid and accurate sparsity allocation learning in minutes, with the added assurance of convergence to a predetermined global sparsity rate. Equipped with these techniques, we can surpass the state-of-the-art methods by a large margin, e.g., over 30\% improvement for ResNet-50 on ImageNet under the sparsity rate of 80\%. Our plug-and-play code and supplementary materials are open-sourced at https://github.com/ModelTC/FCPTS.



Paperid:1355
Authors:Kshitij Goyal, Sebastijan Dumancic, Hendrik Blockeel
KU Leuven, Belgium, Delft University of Technology, The Netherlands, KU Leuven, Belgium
Abstract:
As machine learning models, specifically neural networks, are becoming increasingly popular, there are concerns regarding their trustworthiness, specially in safetycritical applications, e.g. actions of an autonomous vehicle must be safe. There are approaches that can train neural networks where such domain requirements are enforced as constraints, but they either cannot guarantee that the constraint will be satisfied by all possible predictions (even on unseen data) or they are limited in the type of constraints that can be enforced. In this paper, we present an approach to train neural networks which can enforce a wide variety of constraints and guarantee that the constraint is satisfied by all possible predictions. The approach builds on earlier work where learning linear models is formulated as a constraint satisfaction problem (CSP). To make this idea applicable to neural networks, two crucial new elements are added: constraint propagation over the network layers, and weight updates based on a mix of gradient descent and CSP solving. Evaluation on various machine learning tasks demonstrates that our approach is flexible enough to enforce a wide variety of domain constraints and is able to guarantee them in neural networks.



Paperid:1356
Authors:Andreas Grivas, Antonio Vergari, Adam Lopez
University of Edinburgh, University of Edinburgh, University of Edinburgh
Abstract:
Sigmoid output layers are widely used in multilabel classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a low-rank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to k active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck.



Paperid:1357
Authors:Jianyang Gu, Kai Wang, Wei Jiang, Yang You
Zhejiang University National University of Singapore, National University of Singapore, Zhejiang University, National University of Singapore
Abstract:
Replaybased methods have proved their effectiveness on online continual learning by rehearsing past samples from an auxiliary memory. With many efforts made on improving training schemes based on the memory, however, the information carried by each sample in the memory remains under-investigated. Under circumstances with restricted storage space, the informativeness of the memory becomes critical for effective replay. Although some works design specific strategies to select representative samples, by only employing a small number of original images, the storage space is still not well utilized. To this end, we propose to Summarize the knowledge from the Stream Data (SSD) into more informative samples by distilling the training characteristics of real images. Through maintaining the consistency of training gradients and relationship to the past tasks, the summarized samples are more representative for the stream data compared to the original images. Extensive experiments are conducted on multiple online continual learning benchmarks to support that the proposed SSD method significantly enhances the replay effects. We demonstrate that with limited extra computational overhead, SSD provides more than 3% accuracy boost for sequential CIFAR-100 under extremely restricted memory buffer. Code in https://github.com/vimar-gu/SSD.



Paperid:1358
Authors:Anchun Gui, Jinqiang Ye, Han Xiao
Xiamen University, Xiamen University, Xiamen University
Abstract:
It has become a popular paradigm to transfer the knowledge of largescale pre-trained models to various downstream tasks via fine-tuning the entire model parameters. However, with the growth of model scale and the rising number of downstream tasks, this paradigm inevitably meets the challenges in terms of computation consumption and memory footprint issues. Recently, Parameter-Efficient Fine-Tuning (PEFT) (e.g., Adapter, LoRA, BitFit) shows a promising paradigm to alleviate these concerns by updating only a portion of parameters. Despite these PEFTs having demonstrated satisfactory performance in natural language processing, it remains under-explored for the question: whether these techniques could be transferred to graph-based tasks with Graph Transformer Networks (GTNs)? Therefore, in this paper, we fill this gap by providing extensive benchmarks with traditional PEFTs on a range of graph-based downstream tasks. Our empirical study shows that it is sub-optimal to directly transfer existing PEFTs to graph-based tasks due to the issue of feature distribution shift. To address this issue, we propose a novel structure-aware PEFT approach, named G-Adapter, which leverages graph convolution operation to introduce graph structure information (e.g., graph adjacency matrix) as an inductive bias to guide the updating process. Further, we propose Bregman proximal point optimization to alleviate feature distribution shift by preventing the model from aggressive update. Extensive experiments demonstrate that G-Adapter obtains state-of-the-art performance compared to counterparts on nine graph benchmark datasets based on diverse pre-trained GTNs, and delivers tremendous memory footprint efficiency compared to the conventional paradigm.



Paperid:1359
Authors:Xianjie Guo, Kui Yu, Lin Liu, Jiuyong Li
Hefei University of Technology, Hefei University of Technology, University of South Australia, University of South Australia
Abstract:
As an emerging research direction, federated causal structure learning (CSL) aims at learning causal relationships from decentralized data across multiple clients while preserving data privacy. Existing federated CSL algorithms suffer from scalability and accuracy issues, since they require computationally expensive CSL algorithms to be executed at each client. Furthermore, in realworld scenarios, the number of samples held by each client varies significantly, and existing methods still assign equal weights to the learned structural information from each client, which severely harms the learning accuracy of those methods. To address these two limitations, we propose FedCSL, a scalable and accurate method for federated CSL. Specifically, FedCSL consists of two novel strategies: (1) a federated local-to-global learning strategy that enables FedCSL to scale to high-dimensional data for tackling the scalability issue, and (2) a novel weighted aggregation strategy that does not rely on any complex encryption techniques while preserving data privacy for tackling the accuracy issue. Extensive experiments on benchmark datasets, high-dimensional synthetic datasets and a real-world dataset verify the efficacy of the proposed FedCSL method. The source code is available at https://github.com/Xianjie-Guo/FedCSL.



Paperid:1360
Authors:Yufei Guo, Yuanpei Chen, Xiaode Liu, Weihang Peng, Yuhan Zhang, Xuhui Huang, Zhe Ma
Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science and Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC
Abstract:
The Spiking Neural Network (SNN), as one of the biologically inspired neural network infrastructures, has drawn increasing attention recently. It adopts binary spike activations to transmit information, thus the multiplications of activations and weights can be substituted by additions, which brings high energy efficiency. However, in the paper, we theoretically and experimentally prove that the binary spike activation map cannot carry enough information, thus causing information loss and resulting in accuracy decreasing. To handle the problem, we propose a ternary spike neuron to transmit information. The ternary spike neuron can also enjoy the eventdriven and multiplication-free operation advantages of the binary spike neuron but will boost the information capacity. Furthermore, we also embed a trainable factor in the ternary spike neuron to learn the suitable spike amplitude, thus our SNN will adopt different spike amplitudes along layers, which can better suit the phenomenon that the membrane potential distributions are different along layers. To retain the efficiency of the vanilla ternary spike, the trainable ternary spike SNN will be converted to a standard one again via a re-parameterization technique in the inference. Extensive experiments with several popular network structures over static and dynamic datasets show that the ternary spike can consistently outperform state-of-the-art methods. Our code is open-sourced at https://github.com/yfguo91/Ternary-Spike.



Paperid:1361
Authors:Dhawal Gupta, Scott M. Jordan, Shreyas Chaudhari, Bo Liu, Philip S. Thomas, Bruno Castro da Silva
University of Massachusetts, University of Alberta, University of Massachusetts, Amazon, University of Massachusetts, University of Massachusetts
Abstract:
In this paper, we introduce a fresh perspective on the challenges of credit assignment and policy evaluation. First, we delve into the nuances of eligibility traces and explore instances where their updates may result in unexpected credit assignment to preceding states. From this investigation emerges the concept of a novel value function, which we refer to as the ????????????? ????? ????????. Unlike traditional state value functions, bidirectional value functions account for both future expected returns (rewards anticipated from the current state onward) and past expected returns (cumulative rewards from the episode's start to the present). We derive principled update equations to learn this value function and, through experimentation, demonstrate its efficacy in enhancing the process of policy evaluation. In particular, our results indicate that the proposed learning approach can, in certain challenging contexts, perform policy evaluation more rapidly than TD(λ)–a method that learns forward value functions, v^π, ????????. Overall, our findings present a new perspective on eligibility traces and potential advantages associated with the novel value function it inspires, especially for policy evaluation.



Paperid:1362
Authors:Seokhyeon Ha, Sunbeom Jeong, Jungwoo Lee
Seoul National University, Seoul National University, Seoul National University HodooAI LAB
Abstract:
Finetuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.



Paperid:1363
Authors:Qi Han, Li Zhu, Fei Guo
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
The multiarmed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in non-stationary MAB problems. This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.



Paperid:1364
Authors:Teemu Hankala, Miika Hannula, Juha Kontinen, Jonni Virtema
University of Helsinki, University of Helsinki, University of Helsinki, University of Sheffield
Abstract:
The training problem of neural networks (NNs) is known to be ERcomplete with respect to ReLU and linear activation functions. We show that the training problem for NNs equipped with arbitrary activation functions is polynomial-time bireducible to the existential theory of the reals extended with the corresponding activation functions. For effectively continuous activation functions (e.g., the sigmoid function), we obtain an inclusion to low levels of the arithmetical hierarchy. Consequently, the sigmoid activation function leads to the existential theory of the reals with the exponential function, and hence the decidability of training NNs using the sigmoid activation function is equivalent to the decidability of the existential theory of the reals with the exponential function, a long-standing open problem. In contrast, we obtain that the training problem is undecidable if sinusoidal activation functions are considered.



Paperid:1365
Authors:Guang-Yuan Hao, Hengguan Huang, Haotian Wang, Jie Gao, Hao Wang
Hong Kong University of Science and Technology Mohamed bin Zayed University of Artificial Intelligence, National University of Singapore, JD Logistics, Rutgers University, Rutgers University
Abstract:
Active learning (AL) aims to improve model performance within a fixed labeling budget by choosing the most informative data points to label. Existing AL focuses on the singledomain setting, where all data come from the same domain (e.g., the same dataset). However, many real-world tasks often involve multiple domains. For example, in visual recognition, it is often desirable to train an image classifier that works across different environments (e.g., different backgrounds), where images from each environment constitute one domain. Such a multi-domain AL setting is challenging for prior methods because they (1) ignore the similarity among different domains when assigning labeling budget and (2) fail to handle distribution shift of data across different domains. In this paper, we propose the first general method, dubbed composite active learning (CAL), for multi-domain AL. Our approach explicitly considers the domain-level and instance-level information in the problem; CAL first assigns domain-level budgets according to domain-level importance, which is estimated by optimizing an upper error bound that we develop; with the domain-level budgets, CAL then leverages a certain instance-level query strategy to select samples to label from each domain. Our theoretical analysis shows that our method achieves a better error bound compared to current AL methods. Our empirical results demonstrate that our approach significantly outperforms the state-of-the-art AL methods on both synthetic and real-world multi-domain datasets. Code is available at https://github.com/Wang-ML-Lab/multi-domain-active-learning.



Paperid:1366
Authors:Pingting Hao, Kunpeng Liu, Wanfu Gao
Jilin university, Portland State University, Jilin University
Abstract:
Multiview multi-label feature selection aims to select informative features where the data are collected from multiple sources with multiple interdependent class labels. For fully exploiting multi-view information, most prior works mainly focus on the common part in the ideal circumstance. However, the inconsistent part hidden in each view, including noises and specific elements, may affect the quality of mapping between labels and feature representations. Meanwhile, ignoring the specific part might lead to a suboptimal result, as each label is supposed to possess specific characteristics of its own. To deal with the double problems in multi-view multi-label feature selection, we propose a unified loss function which is a totally splitting structure for observed labels as hybrid labels that is, common labels, view-to-all specific labels and noisy labels, and the view-to-all specific labels further splits into several specific labels of each view. The proposed method simultaneously considers the consistency and complementarity of different views. Through exploring the feature weights of hybrid labels, the mapping relationships between labels and features can be established sequentially based on their attributes. Additionally, the interrelatedness among hybrid labels is also investigated and injected into the function. Specific to the specific labels of each view, we construct the novel regularization paradigm incorporating logic operations. Finally, the convergence of the result is proved after applying the multiplicative update rules. Experiments on six datasets demonstrate the effectiveness and superiority of our method compared with the state-of-the-art methods.



Paperid:1367
Authors:Xiaotian Hao, Jianye Hao, Chenjun Xiao, Kai Li, Dong Li, Yan Zheng
College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University Noah’s Ark Lab, Huawei, Noah’s Ark Lab, Huawei, Noah’s Ark Lab, Huawei, Noah’s Ark Lab, Huawei, College of Intelligence and Computing, Tianjin University
Abstract:
AlphaZero and MuZero have achieved stateof-the-art (SOTA) performance in a wide range of domains, including board games and robotics, with discrete and continuous action spaces. However, to obtain an improved policy, they often require an excessively large number of simulations, especially for domains with large action spaces. As the simulation budget decreases, their performance drops significantly. In addition, many important real-world applications have combinatorial (or exponential) action spaces, making it infeasible to search directly over all possible actions. In this paper, we extend AlphaZero and MuZero to learn and plan in more complex multiagent (MA) Markov decision processes, where the action spaces increase exponentially with the number of agents. Our new algorithms, MA Gumbel AlphaZero and MA Gumbel MuZero, respectively without and with model learning, achieve superior performance on cooperative multiagent control problems, while reducing the number of environmental interactions by up to an order of magnitude compared to model-free approaches. In particular, we significantly improve prior performance when planning with much fewer simulation budgets. The code and appendix are available at https://github.com/tjuHaoXiaotian/MA-MuZero.



Paperid:1368
Authors:Mohsin Hasan, Guojun Zhang, Kaiyang Guo, Xi Chen, Pascal Poupart
University of Waterloo Vector Institute, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, University of Waterloo Vector Institute
Abstract:
Federated Learning (FL) involves training a model over a dataset distributed among clients, with the constraint that each client’s dataset is localized and possibly heterogeneous. In FL, small and noisy datasets are common, highlighting the need for wellcalibrated models that represent the uncertainty of predictions. The closest FL techniques to achieving such goals are the Bayesian FL methods which collect parameter samples from local posteriors, and aggregate them to approximate the global posterior. To improve scalability for larger models, one common Bayesian approach is to approximate the global predictive posterior by multiplying local predictive posteriors. In this work, we demonstrate that this method gives systematically overconfident predictions, and we remedy this by proposing β-Predictive Bayes, a Bayesian FL algorithm that interpolates between a mixture and product of the predictive posteriors, using a tunable parameter β. This parameter is tuned to improve the global ensemble’s calibration, before it is distilled to a single model. Our method is evaluated on a variety of regression and classification datasets to demonstrate its superiority in calibration to other baselines, even as data heterogeneity increases. Code available at https://github.com/hasanmohsin/betaPredBayesFL. Our paper's full version is at https://arxiv.org/abs/2312.09817.



Paperid:1369
Authors:Shreyas Havaldar, Jatin Chauhan, Karthikeyan Shanmugam, Jay Nandy, Aravindan Raghuveer
Google Research India, UCLA, Google Research India, Fujitsu Reseach India, Google Research India
Abstract:
Covariate shift in the test data is a common practical phenomena that can significantly downgrade both the accuracy and the fairness performance of the model. Ensuring fairness across different sensitive groups under covariate shift is of paramount importance due to societal implications like criminal justice. We operate in the unsupervised regime where only a small set of unlabeled test samples along with a labeled training set is available. Towards improving fairness under this highly challenging yet realistic scenario, we make three contributions. First is a novel composite weighted entropy based objective for prediction accuracy which is optimized along with a representation matching loss for fairness. We experimentally verify that optimizing with our loss formulation outperforms a number of stateof-the-art baselines in the pareto sense with respect to the fairness-accuracy tradeoff on several standard datasets. Our second contribution is a new setting we term Asymmetric Covariate Shift that, to the best of our knowledge, has not been studied before. Asymmetric covariate shift occurs when distribution of covariates of one group shifts significantly compared to the other groups and this happens when a dominant group is over-represented. While this setting is extremely challenging for current baselines, We show that our proposed method significantly outperforms them. Our third contribution is theoretical, where we show that our weighted entropy term along with prediction loss on the training set approximates test loss under covariate shift. Empirically and through formal sample complexity bounds, we show that this approximation to the unseen test loss does not depend on importance sampling variance which affects many other baselines.



Paperid:1370
Authors:Dongxiao He, Jitao Zhao, Cuiying Huo, Yongqi Huang, Yuxiao Huang, Zhiyong Feng
Tianjin University, Tianjin University, Tianjin University, Tianjin University, George Washington University, Tianjin University
Abstract:
Graph contrastive learning (GCL) has attracted considerable attention because it can selfsupervisedly extract low-dimensional representation of graph data. InfoNCE-based loss function is widely used in graph contrastive learning, which pulls the representations of positive pairs close to each other and pulls the representations of negative pairs away from each other. Recent works mainly focus on designing new augmentation methods or sampling strategies. However, we argue that the widely used InfoNCE-based methods may contain an implicit conflict which seriously confuses models when learning from negative pairs. This conflict is engendered by the encoder's message-passing mechanism and the InfoNCE loss function. As a result, the learned representations between negative samples cannot be far away from each other, compromising the model performance. To our best knowledge, this is the first time to report and analysis this conflict of GCL. To address this problem, we propose a simple but effective method called Partial ignored Graph Contrastive Learning (PiGCL). Specifically, PiGCL first dynamically captures the conflicts during training by detecting the gradient of representation similarities. It then enables the loss function to ignore the conflict, allowing the encoder to adaptively learn the ignored information without self-supervised samples. Extensive experiments demonstrate the effectiveness of our method.



Paperid:1371
Authors:Dongxiao He, Shuwei Liu, Meng Ge, Zhizhi Yu, Guangquan Xu, Zhiyong Feng
Tianjin University, Tianjin University, National University of Singapore, Tianjin University, Tianjin University, Tianjin University
Abstract:
Graph Neural Networks (GNNs) have received widespread attention and applications due to their excellent performance in graph representation learning. Most existing GNNs can only aggregate 1hop neighbors in a GNN layer, so they usually stack multiple GNN layers to obtain more information from larger neighborhoods. However, many studies have shown that model performance experiences a significant degradation with the increase of GNN layers. In this paper, we first introduce the concept of distinguishability of class to indirectly evaluate the learned node representations, and verify the positive correlation between distinguishability of class and model performance. Then, we propose a Graph Neural Network guided by Distinguishability of class (Disc-GNN) to monitor the representation learning, so as to learn better node representations and improve model performance. Specifically, we first perform inter-layer filtering and initial compensation based on Local Distinguishability of Class (LDC) in each layer, so that the learned node representations have the ability to distinguish different classes. Furthermore, we add a regularization term based on Global Distinguishability of Class (GDC) to achieve global optimization of model performance. Extensive experiments on six real-world datasets have shown that the competitive performance of Disc-GNN to the state-of-the-art methods on node classification and node clustering tasks.



Paperid:1372
Authors:Hongcai He, Anjie Zhu, Shuang Liang, Feiyu Chen, Jie Shao
University of Electronic Science and Technology of China, Chengdu, China, University of Electronic Science and Technology of China, Chengdu, China, University of Electronic Science and Technology of China, Chengdu, China, University of Electronic Science and Technology of China, Chengdu, China Sichuan Artificial Intelligence Research Institute, Yibin, China, University of Electronic Science and Technology of China, Chengdu, China Sichuan Artificial Intelligence Research Institute, Yibin, China
Abstract:
Offline metareinforcement learning (meta-RL) methods, which adapt to unseen target tasks with prior experience, are essential in robot control tasks. Current methods typically utilize task contexts and skills as prior experience, where task contexts are related to the information within each task and skills represent a set of temporally extended actions for solving subtasks. However, these methods still suffer from limited performance when adapting to unseen target tasks, mainly because the learned prior experience lacks generalization, i.e., they are unable to extract effective prior experience from meta-training tasks by exploration and learning of continuous latent spaces. We propose a framework called decoupled meta-reinforcement learning (DCMRL), which (1) contrastively restricts the learning of task contexts through pulling in similar task contexts within the same task and pushing away different task contexts of different tasks, and (2) utilizes a Gaussian quantization variational autoencoder (GQ-VAE) for clustering the Gaussian distributions of the task contexts and skills respectively, and decoupling the exploration and learning processes of their spaces. These cluster centers which serve as representative and discrete distributions of task context and skill are stored in task context codebook and skill codebook, respectively. DCMRL can acquire generalizable prior experience and achieve effective adaptation to unseen target tasks during the meta-testing phase. Experiments in the navigation and robot manipulation continuous control tasks show that DCMRL is more effective than previous meta-RL methods with more generalizable prior experience.



Paperid:1373
Authors:Hongyi He, Longjun Liu, Haonan Zhang, Nanning Zheng
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Abstract:
Among existing Neural Architecture Search methods, DARTS is known for its efficiency and simplicity. This approach applies continuous relaxation of network representation to construct a weightsharing supernet and enables the identification of excellent subnets in just a few GPU days. However, performance collapse in DARTS results in deteriorating architectures filled with parameter-free operations and remains a great challenge to the robustness. To resolve this problem, we reveal that the fundamental reason is the biased estimation of the candidate importance in the search space through theoretical and experimental analysis, and more precisely select operations via information-based measurements. Furthermore, we demonstrate that the excessive concern over the supernet and inefficient utilization of data in bi-level optimization also account for suboptimal results. We adopt a more realistic objective focusing on the performance of subnets and simplify it with the help of the informationbased measurements. Finally, we explain theoretically why progressively shrinking the width of the supernet is necessary and reduce the approximation error of optimal weights in DARTS. Our proposed method, named IS-DARTS, comprehensively improves DARTS and resolves the aforementioned problems. Extensive experiments on NAS-Bench-201 and DARTS-based search space demonstrate the effectiveness of IS-DARTS.



Paperid:1374
Authors:Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng
Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Tencent AI Lab, Tencent AI Lab, Tsinghua University, Institute of Automation, Chinese Academy of Sciences School of Future Technology, University of Chinese Academy of Sciences AiRiA
Abstract:
Multitask reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.



Paperid:1375
Authors:Jiujun He, Bin Liu, Guosheng Yin
Center of Statistical Research, School of Statistics, Southwestern University of Finance and Economics, Chengdu, China, Center of Statistical Research, School of Statistics, Southwestern University of Finance and Economics, Chengdu, China, Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong, China
Abstract:
Existing semisupervised domain adaptation (SSDA) models have exhibited impressive performance on the target domain by effectively utilizing few labeled target samples per class (e.g., 3 samples per class). To guarantee an equal number of labeled target samples for each class, however, they require domain experts to manually recognize a considerable amount of the unlabeled target data. Moreover, as the target samples are not equally informative for shaping the decision boundaries of the learning models, it is crucial to select the most informative target samples for labeling, which is, however, impossible for human selectors. As a remedy, we propose an EFfective Target Labeling (EFTL) framework that harnesses active learning and pseudo-labeling strategies to automatically select some informative target samples to annotate. Concretely, we introduce a novel sample query strategy, called non-maximal degree node suppression (NDNS), that iteratively performs maximal degree node query and non-maximal degree node removal to select representative and diverse target samples for labeling. To learn target-specific characteristics, we propose a novel pseudo-labeling strategy that attempts to label low-confidence target samples accurately via clustering consistency (CC), and then inject information of the model uncertainty into our query process. CC enhances the utilization of the annotation budget and increases the number of “labeled” target samples while requiring no additional manual effort. Our proposed EFTL framework can be easily coupled with existing SSDA models, showing significant improvements on three benchmarks



Paperid:1376
Authors:Liang He, Yunan Lu, Weiwei Li, Xiuyi Jia
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, China, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, China
Abstract:
Label distribution learning (LDL) is an effective learning paradigm for handling label ambiguity. When applying LDL, it typically requires datasets annotated with label distributions. However, obtaining supervised data for LDL is a challenging task. Due to the randomness of label annotation, the annotator can produce inaccurate annotation results for the instance, affecting the accuracy and generalization ability of the LDL model. To address this problem, we propose a generative approach to calibrate the inaccurate annotation for LDL using variational inference techniques. Specifically, we assume that instances with similar features share latent similar label distributions. The feature vectors and label distributions are generated by Gaussian mixture and Dirichlet mixture, respectively. The relationship between them is established through a shared categorical variable, which effectively utilizes the label distribution of instances with similar features, and achieves a more accurate label distribution through the generative approach. Furthermore, we use a confusion matrix to model the factors that contribute to the inaccuracy during the annotation process, which captures the relationship between label distributions and inaccurate label distributions. Finally, the label distribution is used to calibrate the available information in the noisy dataset to obtain the groundtruth label distribution.



Paperid:1377
Authors:Rundong He, Yue Yuan, Zhongyi Han, Fan Wang, Wan Su, Yilong Yin, Tongliang Liu, Yongshun Gong
Shandong University, Shandong University, Mohamed bin Zayed University of Artificial Intelligence, Shandong University, Shandong University, Shandong University, The University of Sydney Mohamed bin Zayed University of Artificial Intelligence, Shandong University
Abstract:
Detecting outof-distribution (OOD) data is essential to ensure the reliability of machine learning models when deployed in real-world scenarios. Different from most previous test-time OOD detection methods that focus on designing OOD scores, we delve into the challenges in OOD detection from the perspective of typicality and regard the feature’s high-probability region as the feature’s typical set. However, the existing typical-feature-based OOD detection method implies an assumption: the proportion of typical feature sets for each channel is fixed. According to our experimental analysis, each channel contributes differently to OOD detection. Adopting a fixed proportion for all channels results in several channels losing too many typical features or incorporating too many abnormal features, resulting in low performance. Therefore, exploring the channel-aware typical features is crucial to better-separating ID and OOD data. Driven by this insight, we propose expLoring channel-Aware tyPical featureS (LAPS). Firstly, LAPS obtains the channel-aware typical set by calibrating the channel-level typical set with the global typical set from the mean and standard deviation. Then, LAPS rectifies the features into channel-aware typical sets to obtain channel-aware typical features. Finally, LAPS leverages the channel-aware typical features to calculate the energy score for OOD detection. Theoretical and visual analyses verify that LAPS achieves a better bias-variance trade-off. Experiments verify the effectiveness and generalization of LAPS under different architectures and OOD scores.



Paperid:1378
Authors:Yu-Cheng He, Yao-Xiang Ding, Han-Jia Ye, Zhi-Hua Zhou
National Key Laboratory for Novel Software Technology, Nanjing University, China, State Key Laboratory of CAD & CG, Zhejiang University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Abstract:
Most current longtailed classification approaches assume the cost-agnostic scenario, where the training distribution of classes is long-tailed while the testing distribution of classes is balanced. Meanwhile, the misclassification costs of all instances are the same. On the other hand, in many real-world applications, it is more proper to assume that the training and testing distributions of classes are the same, while the misclassification cost of tail-class instances is varied. In this work, we model such a scenario as cost-aware long-tailed classification, in which the identification of high-cost tail instances and focusing learning on them thereafter is essential. In consequence, we propose the learning strategy of augmenting new instances based on adaptive region partition in the feature space. We conduct theoretical analysis to show that under the assumption that the feature-space distance and the misclassification cost are correlated, the identification of high-cost tail instances can be realized by building region partitions with a low variance of risk within each region. The resulting AugARP approach could significantly outperform baseline approaches on both benchmark datasets and real-world product sales datasets.



Paperid:1379
Authors:Yuhang He, Zhuangzhuang Dai, Niki Trigoni, Long Chen, Andrew Markham
University of Oxford, UK, Aston University, UK, University of Oxford, UK, Institute of Automation, Chinese Academy of Science, UK, University of Oxford
Abstract:
In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sounds in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel endto-end trainable neural network~(which we call DyDecNet, consisting of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. The dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain time-frequency representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it to each intermediate parent waveform before feeding it to the two child filters. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony and mean polyphony. We test DyDecNet on various datasets to show its superiority, and we further show dyadic decomposition network can be used as a general front-end to tackle other acoustic tasks.



Paperid:1380
Authors:Zhimin He, Maijie Deng, Shenggen Zheng, Lvzhou Li, Haozhen Situ
School of Electronic and Information Engineering, Foshan University, School of Mechatronic Engineering and Automation, Foshan University, Peng Cheng Laboratory, Institute of Quantum Computing and Computer Theory, School of Computer Science and Engineering, Sun Yat-Sen University, College of Mathematics and Informatics, South China Agricultural University
Abstract:
Variational quantum algorithm (VQA) derives advantages from its error resilience and high flexibility in quantum resource requirements, rendering it broadly applicable in the noisy intermediatescale quantum era. As the performance of VQA highly relies on the structure of the parameterized quantum circuit, it is worthwhile to propose quantum architecture search (QAS) algorithms to automatically search for high-performance circuits. Nevertheless, existing QAS methods are time-consuming, requiring circuit training to assess circuit performance. This study pioneers training-free QAS by utilizing two training-free proxies to rank quantum circuits, in place of the expensive circuit training employed in conventional QAS. Taking into account the precision and computational overhead of the path-based and expressibility-based proxies, we devise a two-stage progressive training-free QAS (TF-QAS). Initially, directed acyclic graphs (DAGs) are employed for circuit representation, and a zero-cost proxy based on the number of paths in the DAG is designed to filter out a substantial portion of unpromising circuits. Subsequently, an expressibility-based proxy, finely reflecting circuit performance, is employed to identify high-performance circuits from the remaining candidates. These proxies evaluate circuit performance without circuit training, resulting in a remarkable reduction in computational cost compared to current training-based QAS methods. Simulations on three VQE tasks demonstrate that TF-QAS achieves a substantial enhancement of sampling efficiency ranging from 5 to 57 times compared to state-of-the-art QAS, while also being 6 to 17 times faster.



Paperid:1381
Authors:Huy Hoang, Tien Mai, Pradeep Varakantham
Singapore Management University, Singapore Management University, Singapore Management University
Abstract:
A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates "good" trajectories and avoids "bad" trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as "good" or "bad". A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.



Paperid:1382
Authors:Minh Hoang, Trong Nghia Hoang
Princeton University, Washington State University
Abstract:
This paper investigates the problem of exploiting existing solution models of previous tasks to address a related target task with limited training data. Existing approaches addressing this problem often require access to the internal parameterization of the existing solution models and possibly their training data, which is not possible in many practical settings. To relax this requirement, We approach this problem from a new perspective of blackbox re-purposing, which augments the target inputs and leverages their corresponding outputs generated by existing black-box APIs into a feature ensemble. We hypothesize that such feature ensemble can be learned to incorporate and encode relevant black-box knowledge into the feature representation of target data, which will compensate for their scarcity. This hypothesis is confirmed via the reported successes of our proposed black-box ensemble in solving multiple few-shot learning tasks derived from various benchmark datasets. All reported results show consistently that the set of heterogeneous black-box solutions of previous tasks can indeed be reused and combined effectively to solve a reasonably related target task without requiring access to a large training dataset. This is the first step towards enabling new possibilities to further supplement existing techniques in transfer or meta learning with black-box knowledge.



Paperid:1383
Authors:Van Thuy Hoang, O-Joun Lee
Department of Artificial Intelligence, The Catholic University of Korea, Department of Artificial Intelligence, The Catholic University of Korea
Abstract:
Graph representation learning (GRL) methods, such as graph neural networks and graph transformer models, have been successfully used to analyze graphstructured data, mainly focusing on node classification and link prediction tasks. However, the existing studies mostly only consider local connectivity while ignoring long-range connectivity and the roles of nodes. In this paper, we propose Unified Graph Transformer Networks (UGT) that effectively integrate local and global structural information into fixed-length vector representations. First, UGT learns local structure by identifying the local sub-structures and aggregating features of the k-hop neighborhoods of each node. Second, we construct virtual edges, bridging distant nodes with structural similarity to capture the long-range dependencies. Third, UGT learns unified representations through self-attention, encoding structural distance and p-step transition probability between node pairs. Furthermore, we propose a self-supervised learning task that effectively learns transition probability to fuse local and global structural features, which could then be transferred to other downstream tasks. Experimental results on real-world benchmark datasets over various downstream tasks showed that UGT significantly outperformed baselines that consist of state-of-the-art models. In addition, UGT reaches the third-order Weisfeiler-Lehman power to distinguish non-isomorphic graph pairs.



Paperid:1384
Authors:Jakob Hollenstein, Georg Martius, Justus Piater
University of Innsbruck, University of Tübingen Max Planck Institute for Intelligent Systems, University of Innsbruck
Abstract:
Proximal Policy Optimization (PPO), a popular onpolicy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.



Paperid:1385
Authors:Hyeong Gwon Hong, Yooshin Cho, Hanbyel Cho, Jaesung Ahn, Junmo Kim
KAIST, KAIST, KAIST, KAIST, KAIST
Abstract:
Gradient inversion attacks can leak data privacy when clients share weight updates with the server in federated learning (FL). Existing studies mainly use L2 or cosine distance as the loss function for gradient matching in the attack. Our empirical investigation shows that the vulnerability ranking varies with the loss function used. Gradient norm, which is commonly used as a vulnerability proxy for gradient inversion attack, cannot explain this as it remains constant regardless of the loss function for gradient matching. In this paper, we propose a lossaware vulnerability proxy (LAVP) for the first time. LAVP refers to either the maximum or minimum eigenvalue of the Hessian with respect to gradient matching loss at ground truth. This suggestion is based on our theoretical findings regarding the local optimization of the gradient inversion in proximity to the ground truth, which corresponds to the worst case attack scenario. We demonstrate the effectiveness of LAVP on various architectures and datasets, showing its consistent superiority over the gradient norm in capturing sample vulnerabilities. The performance of each proxy is measured in terms of Spearman's rank correlation with respect to several similarity scores. This work will contribute to enhancing FL security against any potential loss functions beyond L2 or cosine distance in the future.



Paperid:1386
Authors:Snir Hordan, Tal Amir, Steven J. Gortler, Nadav Dym
Technion, Technion, Harvard University, Technion
Abstract:
Neural networks for point clouds, which respect their natural invariance to permutation and rigid motion, have enjoyed recent success in modeling geometric phenomena, from molecular dynamics to recommender systems. Yet, to date, no architecture with polynomial complexity is known to be complete, that is, able to distinguish between any pair of nonisomorphic point clouds. We fill this theoretical gap by showing that point clouds can be completely determined, up to permutation and rigid motion, by applying the 3-WL graph isomorphism test to the point cloud's centralized Gram matrix. Moreover, we formulate an Euclidean variant of the 2-WL test and show that it is also sufficient to achieve completeness. We then show how our complete Euclidean WL tests can be simulated by an Euclidean graph neural network of moderate size and demonstrate their separation capability on highly symmetrical point clouds.



Paperid:1387
Authors:Dou Hu, Lingwei Wei, Yaxin Liu, Wei Zhou, Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
Abstract:
This paper presents a new supervised representation learning framework, namely structured probabilistic coding (SPC), to learn compact and informative representations from input related to the target task. SPC is an encoderonly probabilistic coding technology with a structured regularization from the target space. It can enhance the generalization ability of pre-trained language models for better language understanding. Specifically, our probabilistic coding simultaneously performs information encoding and task prediction in one module to more fully utilize the effective information from input data. It uses variational inference in the output space to reduce randomness and uncertainty. Besides, to better control the learning process of probabilistic representations, a structured regularization is proposed to promote uniformity across classes in the latent space. With the regularization term, SPC can preserve the Gaussian structure of the latent code and achieve better coverage of the hidden space with class uniformly. Experimental results on 12 natural language understanding tasks demonstrate that our SPC effectively improves the performance of pre-trained language models for classification and regression. Extensive experiments show that SPC can enhance the generalization capability, robustness to label noise, and clustering quality of output representations.



Paperid:1388
Authors:Francois Hu, Philipp Ratz, Arthur Charpentier
Université de Montréal, Université du Québec à Montréal, Université du Québec à Montréal
Abstract:
In the standard use case of Algorithmic Fairness, the goal is to eliminate the relationship between a sensitive variable and a corresponding score. Throughout recent years, the scientific community has developed a host of definitions and tools to solve this task, which work well in many practical applications. However, the applicability and effectivity of these tools and definitions becomes less straightfoward in the case of multiple sensitive attributes. To tackle this issue, we propose a sequential framework, which allows to progressively achieve fairness across a set of sensitive features. We accomplish this by leveraging multimarginal Wasserstein barycenters, which extends the standard notion of Strong Demographic Parity to the case with multiple sensitive characteristics. This method also provides a closed-form solution for the optimal, sequentially fair predictor, permitting a clear interpretation of inter-sensitive feature correlations. Our approach seamlessly extends to approximate fairness, enveloping a framework accommodating the trade-off between risk and unfairness. This extension permits a targeted prioritization of fairness improvements for a specific attribute within a set of sensitive attributes, allowing for a case specific adaptation. A data-driven estimation procedure for the derived solution is developed, and comprehensive numerical experiments are conducted on both synthetic and real datasets. Our empirical findings decisively underscore the practical efficacy of our post-processing approach in fostering fair decision-making.



Paperid:1389
Authors:Jian Hu, Jiayi Lin, Shaogang Gong, Weitong Cai
Queen Mary University of London, Queen Mary University of London, Queen Mary University of London, Queen Mary University of London
Abstract:
Camouflaged object detection (COD) approaches heavily rely on pixellevel annotated datasets. Weakly-supervised COD (WSCOD) approaches use sparse annotations like scribbles or points to reduce annotation efforts, but this can lead to decreased accuracy. The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. Additionally, it only provides localization information instead of semantic one, which can intrinsically cause ambiguity in interpreting targets. In this work, we aim to eliminate the need for manual prompt. The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. To that end, we introduce a test-time instance-wise adaptation mechanism called Generalizable SAM (GenSAM) to automatically generate and optimize visual prompts from the generic task prompt for WSCOD. In particular, CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-language models, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targeted region in a coarse-to-fine manner. Crucially, all network parameters are fixed, avoiding the need for additional training. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions. Our codes is in https://github.com/jyLin8100/GenSAM.



Paperid:1390
Authors:Kun Hu, Wenjing Yang, Wanrong Huang, Xianchen Zhou, Mingyu Cao, Jing Ren, Huibin Tan
College of Computer Science and Technology, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology, College of Sciences, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology, College of Computer Science and Technology, National University of Defense Technology
Abstract:
Regarded as a templatematching task for a long time, visual object tracking has witnessed significant progress in space-wise exploration. However, since tracking is performed on videos with substantial time-wise information, it is important to simultaneously mine the temporal contexts which have not yet been deeply explored. Previous supervised works mostly consider template reform as the breakthrough point, but they are often limited by additional computational burdens or the quality of chosen templates. To address this issue, we propose a Space-Time Consistent Transformer Tracker (STCFormer), which uses a sequential fusion framework with multi-granularity consistency constraints to learn spatiotemporal context information. We design a sequential fusion framework that recombines template and search images based on tracking results from chronological frames, fusing updated tracking states in training. To further overcome the over-reliance on the fixed template without increasing computational complexity, we design three space-time consistent constraints: Label Consistency Loss (LCL) for label-level consistency, Attention Consistency Loss (ACL) for patch-level ROI consistency, and Semantic Consistency Loss (SCL) for feature-level semantic consistency. Specifically, in ACL and SCL, the label information is used to constrain the attention and feature consistency of the target and the background, respectively, to avoid mutual interference. Extensive experiments have shown that our STCFormer outperforms many of the best-performing trackers on several popular benchmarks.



Paperid:1391
Authors:Ming Hu, Yue Cao, Anran Li, Zhiming Li, Chengwei Liu, Tianlin Li, Mingsong Chen, Yang Liu
Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, East China Normal University, Nanyang Technological University
Abstract:
Although Federated Learning (FL) enables collaborative model training without sharing the raw data of clients, it encounters lowperformance problems caused by various heterogeneous scenarios. Due to the limitation of dispatching the same global model to clients for local training, traditional Federated Average (FedAvg)-based FL models face the problem of easily getting stuck into a sharp solution, which results in training a low-performance global model. To address this problem, this paper presents a novel FL approach named FedMut, which mutates the global model according to the gradient change to generate several intermediate models for the next round of training. Each intermediate model will be dispatched to a client for local training. Eventually, the global model converges into a flat area within the range of mutated models and has a well-generalization compared with the global model trained by FedAvg. Experimental results on well-known datasets demonstrate the effectiveness of our FedMut approach in various data heterogeneity scenarios.



Paperid:1392
Authors:Siyuan Hu, Zheng Wang, Peng Hu, Xi Peng, Jie Wu, Hongyuan Zhu, Yew Soon Ong
Nanyang Technological University, Wuhan University, Sichuan University, Sichuan University, Wuhan University, Institute for Infocomm Research (I2R) & Centre for Frontier AI Research (CFAR), A*STAR, Singapore, Nanyang Technological University Institute for Infocomm Research (I2R) & Centre for Frontier AI Research (CFAR), A*STAR, Singapore
Abstract:
Videobased facial analysis is important for autonomous agents to understand human expressions and sentiments. However, limited labeled data is available to learn effective facial representations. This paper proposes a novel self-supervised face-centric pretraining framework, called PrefAce, which learns transferable video facial representation without labels. The self-supervised learning is performed with an effective landmark-guided global-local tube distillation. Meanwhile, a novel instance-wise update FaceFeat Cache is built to enforce more discriminative and diverse representations for downstream tasks. Extensive experiments demonstrate that the proposed framework learns universal instance-aware facial representations with fine-grained landmark details from videos. The point is that it can transfer across various facial analysis tasks, e.g., Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our framework also outperforms the state-of-the-art on various downstream tasks, even in low data regimes. Code is available at https://github.com/siyuan-h/PrefAce.



Paperid:1393
Authors:Tianmeng Hu, Biao Luo
School of Automation, Central South University, Changsha 410083, China, School of Automation, Central South University, Changsha 410083, China
Abstract:
Multiobjective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.



Paperid:1394
Authors:Wenbo Hu, Hongjian Zhan, Xinchen Ma, Yue Lu, Ching Y. Suen
Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University Centre for Pattern Recognition and Machine Intelligence, Concordia University, Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Centre for Pattern Recognition and Machine Intelligence, Concordia University
Abstract:
Humans often require only a few visual archetypes to spot novel objects. Based on this observation, we present a strategy rooted in ``spotting the unseen" by establishing dense correspondences between potential query image regions and a visual archetype, and we propose the Consensus Network (CoNet). Our method leverages relational patterns intra and inter images via AutoCorrelation Representation (ACR) and Mutual-Correlation Representation (MCR). Within each image, the ACR module is capable of encoding both local self-similarity and global context simultaneously. Between the query and support images, the MCR module computes the cross-correlation across two image representations and introduces a reciprocal consistency constraint, which can incorporate to exclude outliers and enhance model robustness. To overcome the challenges of low-resource training data, particularly in one-shot learning scenarios, we incorporate an adaptive margin strategy to better handle diverse instances. The experimental results indicate the effectiveness of the proposed method across diverse domains such as object detection in natural scenes, and text spotting in both historical manuscripts and natural scenes, which demonstrates its sparkling generalization ability. Our code is available at: https://github.com/infinite-hwb/conet.



Paperid:1395
Authors:Zexin Hu, Kun Hu, Clinton Mo, Lei Pan, Zhiyong Wang
The University of Sydney, The University of Sydney, The University of Sydney, Civil Aviation Flight University of China, The University of Sydney
Abstract:
Sketchbased terrain generation seeks to create realistic landscapes for virtual environments in various applications such as computer games, animation and virtual reality. Recently, deep learning based terrain generation has emerged, notably the ones based on generative adversarial networks (GAN). However, these methods often struggle to fulfill the requirements of flexible user control and maintain generative diversity for realistic terrain. Therefore, we propose a novel diffusion-based method, namely terrain diffusion network (TDN), which actively incorporates user guidance for enhanced controllability, taking into account terrain features like rivers, ridges, basins, and peaks. Instead of adhering to a conventional monolithic denoising process, which often compromises the fidelity of terrain details or the alignment with user control, a multi-level denoising scheme is proposed to generate more realistic terrains by taking into account fine-grained details, particularly those related to climatic patterns influenced by erosion and tectonic activities. Specifically, three terrain synthesisers are designed for structural, intermediate, and fine-grained level denoising purposes, which allow each synthesiser concentrate on a distinct terrain aspect. Moreover, to maximise the efficiency of our TDN, we further introduce terrain and sketch latent spaces for the synthesizers with pre-trained terrain autoencoders. Comprehensive experiments on a new dataset constructed from NASA Topology Images clearly demonstrate the effectiveness of our proposed method, achieving the state-of-the-art performance. Our code is available at https://github.com/TDNResearch/TDN.



Paperid:1396
Authors:Hao Huang, Tapan Shah, Scott Evans, Shinjae Yoo
GE Vernova Research, GE Vernova Research, GE Vernova Research, Brookhaven National Laboratory
Abstract:
Efficiently processing time series data streams in realtime on resource-constrained devices offers significant advantages in terms of enhanced computational energy efficiency and reduced time-related risks. We introduce an innovative streaming time series classification network that utilizes attentive power iteration, enabling real-time processing on resource-constrained devices. Our model continuously updates a compact representation of the entire time series, enhancing classification accuracy while conserving energy and processing time. Notably, it excels in streaming scenarios without requiring complete time series access, enabling swift decisions. Experimental results show that our approach excels in classification accuracy and energy efficiency, with over 70% less consumption and threefold faster task completion than benchmarks. This work advances real-time responsiveness, energy conservation, and operational effectiveness for constrained devices, contributing to optimizing various applications.



Paperid:1397
Authors:Junyu Huang, Qilong Feng, Jiahui Wang, Ziyun Huang, Jinhui Xu, Jianxin Wang
Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China, Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China, Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China, Penn State Erie, The Behrend College, State University of New York at Buffalo, NY, USA, Central South University, Changsha, China Xiangjiang Laboratory, Changsha, China The Hunan Provincial Key Lab of Bioinformatics, Central South University, Changsha, China
Abstract:
As one of the most popular machine learning tools in the field of unsupervised learning, clustering has been widely used in various practical applications. While numerous methods have been proposed for clustering, a commonly encountered issue is that the existing clustering methods rely heavily on local neighborhood information during the optimization process, which leads to suboptimal performance on realworld datasets. Besides, most existing clustering methods use Euclidean distances or densities to measure the similarity between data points. This could constrain the effectiveness of the algorithms for handling datasets with irregular patterns. Thus, a key challenge is how to effectively capture the global structural information in clustering instances to improve the clustering quality. In this paper, we propose a new clustering algorithm, called SEC. This algorithm uses the global structural information extracted from an encoding tree to guide the clustering optimization process. Based on the relation between data points in the instance, a sparse graph of the clustering instance can be constructed. By leveraging the sparse graph constructed, we propose an iterative encoding tree method, where hierarchical abstractions of the encoding tree are iteratively extracted as new clustering features to obtain better clustering results. To avoid the influence of easily misclustered data points located on the boundaries of the clustering partitions, which we call "fringe points", we propose an iterative pre-deletion and reassignment technique such that the algorithm can delete and reassign the "fringe points" to obtain more resilient and precise clustering results. Empirical experiments on both synthetic and real-world datasets demonstrate that our proposed algorithm outperforms state-of-the-art clustering methods and achieves better clustering performances. On average, the clustering accuracy (ACC) is increased by 1.7% and the normalized mutual information (NMI) by 7.9% compared with the current state-of-the-art (SOTA) algorithm on synthetic datasets. On real-world datasets, our method outperforms other clustering methods with an average increase of 12.3% in ACC and 5.2% in NMI, respectively.



Paperid:1398
Authors:Libo Huang, Yan Zeng, Chuanguang Yang, Zhulin An, Boyu Diao, Yongjun Xu
Institute of Computing Technology, Chinese Academy of Sciences, School of Mathematics and Statistics, Beijing Technology and Business University, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences
Abstract:
Class incremental learning (CIL) aims to solve the notorious forgetting problem, which refers to the fact that once the network is updated on a new task, its performance on previouslylearned tasks degenerates catastrophically. Most successful CIL methods store exemplars (samples of learned tasks) to train a feature extractor incrementally, or store prototypes (features of learned tasks) to estimate the incremental feature distribution. However, the stored exemplars would violate the data privacy concerns, while the fixed prototypes might not reasonably be consistent with the incremental feature distribution, hindering the exploration of real-world CIL applications. In this paper, we propose a data-free CIL method with embedding distillation and Task-oriented generation (eTag), which requires neither exemplar nor prototype. Embedding distillation prevents the feature extractor from forgetting by distilling the outputs from the networks' intermediate blocks. Task-oriented generation enables a lightweight generator to produce dynamic features, fitting the needs of the top incremental classifier. Experimental results confirm that the proposed eTag considerably outperforms state-of-the-art methods on several benchmark datasets.



Paperid:1399
Authors:Nai-Chieh Huang, Ping-Chun Hsieh, Kuo-Hao Ho, I-Chen Wu
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University
Abstract:
Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPOClip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings. Our findings highlight the O(1/√T ) min-iterate convergence rate specifically in the context of neural function approximation. We tackle the inherent challenges in analyzing PPO-Clip through three central concepts: (i) We introduce a generalized version of the PPO-Clip objective, illuminated by its connection with the hinge loss. (ii) Employing entropic mirror descent, we establish asymptotic convergence for tabular PPO-Clip with direct policy parameterization. (iii) Inspired by the tabular analysis, we streamline convergence analysis by introducing a two-step policy improvement approach. This decouples policy search from complex neural policy parameterization using a regression-based update scheme. Furthermore, we gain deeper insights into the efficacy of PPO-Clip by interpreting these generalized objectives. Our theoretical findings also mark the first characterization of the influence of the clipping mechanism on PPO-Clip convergence. Importantly, the clipping range affects only the pre-constant of the convergence rate.



Paperid:1400
Authors:Qihe Huang, Lei Shen, Ruixin Zhang, Jiahuan Cheng, Shouhong Ding, Zhengyang Zhou, Yang Wang
University of Science and Technology of China (USTC) Youtu Laboratory, Tencent, Youtu Laboratory, Tencent, Youtu Laboratory, Tencent, Youtu Laboratory, Tencent Johns Hopkins University, Youtu Laboratory, Tencent, University of Science and Technology of China (USTC) Suzhou Institute for Advanced Research of USTC State Key Laboratory of Resources and Environmental Information System, University of Science and Technology of China (USTC) Suzhou Institute for Advanced Research of USTC
Abstract:
Multivariate time series (MTS) prediction has been widely adopted in various scenarios. Recently, some methods have employed patching to enhance local semantics and improve model performance. However, lengthfixed patch are prone to losing temporal boundary information, such as complete peaks and periods. Moreover, existing methods mainly focus on modeling long-term dependencies across patches, while paying little attention to other dimensions (e.g., short-term dependencies within patches and complex interactions among cross-variavle patches). To address these challenges, we propose a pure MLP-based HDMixer, aiming to acquire patches with richer semantic information and efficiently modeling hierarchical interactions. Specifically, we design a Length-Extendable Patcher (LEP) tailored to MTS, which enriches the boundary information of patches and alleviates semantic incoherence in series. Subsequently, we devise a Hierarchical Dependency Explorer (HDE) based on pure MLPs. This explorer effectively models short-term dependencies within patches, long-term dependencies across patches, and complex interactions among variables. Extensive experiments on 9 real-world datasets demonstrate the superiority of our approach. The code is available at https://github.com/hqh0728/HDMixer.



Paperid:1401
Authors:Renhong Huang, Jiarong Xu, Xin Jiang, Chenglu Pan, Zhiming Yang, Chunping Wang, Yang Yang
Zhejiang University Fudan University, Fudan University, Lehigh University, Zhejiang University, Fudan University, FinVolution, Zhejiang University
Abstract:
The paradigm of pretraining and fine-tuning graph neural networks has attracted wide research attention. In previous studies, the pre-trained models are viewed as universally versatile, and applied for a diverse range of downstream tasks. In many situations, however, this practice results in limited or even negative transfer. This paper, for the first time, emphasizes the specific application scope of graph pre-trained models: not all downstream tasks can effectively benefit from a graph pre-trained model. In light of this, we introduce the measure task consistency to quantify the similarity between graph pre-training and downstream tasks. This measure assesses the extent to which downstream tasks can benefit from specific pre-training tasks. Moreover, a novel fine-tuning strategy, Bridge-Tune, is proposed to further diminish the impact of the difference between pre-training and downstream tasks. The key innovation in Bridge-Tune is an intermediate step that bridges pre-training and downstream tasks. This step takes into account the task differences and further refines the pre-trained model. The superiority of the presented fine-tuning strategy is validated via numerous experiments with different pre-trained models and downstream tasks.



Paperid:1402
Authors:Rundong Huang, Farhad Shirani, Dongsheng Luo
Technical University of Munich, Munich, Germany, Florida International University, Miami, U.S., Florida International University, Miami, U.S.
Abstract:
Graph Neural Networks (GNNs) have received increasing attention due to their ability to learn from graphstructured data. To open the black-box of these deep learning models, post-hoc instance-level explanation methods have been proposed to understand GNN predictions. These methods seek to discover substructures that explain the prediction behavior of a trained GNN. In this paper, we show analytically that for a large class of explanation tasks, conventional approaches, which are based on the principle of graph information bottleneck (GIB), admit trivial solutions that do not align with the notion of explainability. Instead, we argue that a modified GIB principle may be used to avoid the aforementioned trivial solutions. We further introduce a novel factorized explanation model with theoretical performance guarantees. The modified GIB is used to analyze the structural properties of the proposed factorized explainer. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our proposed factorized explainer.



Paperid:1403
Authors:Xiaobin Huang, Lei Song, Ke Xue, Chao Qian
Nanjing University, Nanjing University, Nanjing University, Nanjing University
Abstract:
Bayesian optimization (BO) is a sampleefficient method and has been widely used for optimizing expensive black-box functions. Recently, there has been a considerable interest in BO literature in optimizing functions that are affected by context variable in the environment, which is uncontrollable by decision makers. In this paper, we focus on the optimization of functions' expectations over continuous context variable, subject to an unknown distribution. To address this problem, we propose two algorithms that employ kernel density estimation to learn the probability density function (PDF) of continuous context variable online. The first algorithm is simpler, which directly optimizes the expectation under the estimated PDF. Considering that the estimated PDF may have high estimation error when the true distribution is complicated, we further propose the second algorithm that optimizes the distributionally robust objective. Theoretical results demonstrate that both algorithms have sub-linear Bayesian cumulative regret on the expectation objective. Furthermore, we conduct numerical experiments to empirically demonstrate the effectiveness of our algorithms.



Paperid:1404
Authors:Xiaolong Huang, Qiankun Li, Xueran Li, Xuesong Gao
School of Artificial Intelligent, Chongqing University of Technology, Chongqing, China, Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China Department of Automation, University of Science and Technology of China, Hefei, China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China Anhui University, Hefei, China, School of Artificial Intelligent, Chongqing University of Technology, Chongqing, China
Abstract:
Visual finetuning has garnered significant attention with the rise of pre-trained vision models. The current prevailing method, full fine-tuning, suffers from the issue of knowledge forgetting as it focuses solely on fitting the downstream training set. In this paper, we propose a novel weight rollback-based fine-tuning method called OLOR (One step Learning, One step Review). OLOR combines fine-tuning with optimizers, incorporating a weight rollback term into the weight update term at each step. This ensures consistency in the weight range of upstream and downstream models, effectively mitigating knowledge forgetting and enhancing fine-tuning performance. In addition, a layer-wise penalty is presented to employ penalty decay and the diversified decay rate to adjust the weight rollback levels of layers for adapting varying downstream tasks. Through extensive experiments on various tasks such as image classification, object detection, semantic segmentation, and instance segmentation, we demonstrate the general applicability and state-of-the-art performance of our proposed OLOR. Code is available at https://github.com/rainbow-xiao/OLOR-AAAI-2024.



Paperid:1405
Authors:Yiming Huang, Yujie Zeng, Qiang Wu, Linyuan Lü
Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, School of Cyber Science and Technology, University of Science and Technology of China Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
Abstract:
Despite the recent successes of vanilla Graph Neural Networks (GNNs) on various tasks, their foundation on pairwise networks inherently limits their capacity to discern latent higherorder interactions in complex systems. To bridge this capability gap, we propose a novel approach exploiting the rich mathematical theory of simplicial complexes (SCs) - a robust tool for modeling higher-order interactions. Current SC-based GNNs are burdened by high complexity and rigidity, and quantifying higher-order interaction strengths remains challenging. Innovatively, we present a higher-order Flower-Petals (FP) model, incorporating FP Laplacians into SCs. Further, we introduce a Higher-order Graph Convolutional Network (HiGCN) grounded in FP Laplacians, capable of discerning intrinsic features across varying topological scales. By employing learnable graph filters, a parameter group within each FP Laplacian domain, we can identify diverse patterns where the filters' weights serve as a quantifiable measure of higher-order interaction strengths. The theoretical underpinnings of HiGCN's advanced expressiveness are rigorously demonstrated. Additionally, our empirical investigations reveal that the proposed model accomplishes state-of-the-art performance on a range of graph tasks and provides a scalable and flexible solution to explore higher-order interactions in graphs. Codes and datasets are available at https://github.com/Yiminghh/HiGCN.



Paperid:1406
Authors:Yufei Huang, Siyuan Li, Lirong Wu, Jin Su, Haitao Lin, Odin Zhang, Zihan Liu, Zhangyang Gao, Jiangbin Zheng, Stan Z. Li
Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University, Hangzhou, Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University, Hangzhou AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University
Abstract:
Protein structurebased property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures.



Paperid:1407
Authors:Zhilin Huang, Ling Yang, Zaixi Zhang, Xiangxin Zhou, Yu Bao, Xiawu Zheng, Yuwei Yang, Yu Wang, Wenming Yang
Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory, Peking University, University of Science and Technology of China, University of Chinese Academy of Sciences, ByteDance, Peng Cheng Laboratory, ByteDance, Peng Cheng Laboratory, Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory
Abstract:
Structurebased drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Avg. Vina Score, while maintaining proper molecular properties. Our code is available at https://github.com/YangLing0818/BindDM



Paperid:1408
Authors:Tim Huisman, Jacobus G. M. van der Linden, Emir Demirović
Delft University of Technology, Delft University of Technology, Delft University of Technology
Abstract:
Survival analysis studies and predicts the time of death, or other singular unrepeated events, based on historical data, while the true time of death for some instances is unknown. Survival trees enable the discovery of complex nonlinear relations in a compact human comprehensible model, by recursively splitting the population and predicting a distinct survival distribution in each leaf node. We use dynamic programming to provide the first survival tree method with optimality guarantees, enabling the assessment of the optimality gap of heuristics. We improve the scalability of our method through a special algorithm for computing trees up to depth two. The experiments show that our method's run time even outperforms some heuristics for realistic cases while obtaining similar outof-sample performance with the state-of-the-art.



Paperid:1409
Authors:Fushuo Huo, Wenchao Xu, Song Guo, Jingcai Guo, Haozhao Wang, Ziming Liu, Xiaocheng Lu
Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, China, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China, Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR
Abstract:
OpenWorld Compositional Zero-shot Learning (OW-CZSL) aims to recognize novel compositions of state and object primitives in images with no priors on the compositional space, which induces a tremendously large output space containing all possible state-object compositions. Existing works either learn the joint compositional state-object embedding or predict simple primitives with separate classifiers. However, the former method heavily relies on external word embedding methods, and the latter ignores the interactions of interdependent primitives, respectively. In this paper, we revisit the primitive prediction approach and propose a novel method, termed Progressive Cross-primitive Compatibility (ProCC), to mimic the human learning process for OW-CZSL tasks. Specifically, the cross-primitive compatibility module explicitly learns to model the interactions of state and object features with the trainable memory units, which efficiently acquires cross-primitive visual attention to reason high-feasibility compositions, without the aid of external knowledge. Moreover, to alleviate the invalid cross-primitive interactions, especially for partial-supervision conditions (pCZSL), we design a progressive training paradigm to optimize the primitive classifiers conditioned on pre-trained features in an easy-to-hard manner. Extensive experiments on three widely used benchmark datasets demonstrate that our method outperforms other representative methods on both OW-CZSL and pCZSL settings by large margins.



Paperid:1410
Authors:Fushuo Huo, Wenchao Xu, Jingcai Guo, Haozhao Wang, Yunfeng Fan
The Hong Kong Polytechnic University, The Hong Kong Polytechnic University, The Hong Kong Polytechnic University The Hong Kong Polytechnic University Shenzhen Research Institute, Huazhong University of Science and Technology, The Hong Kong Polytechnic University
Abstract:
This paper investigates a new, practical, but challenging problem named Nonexemplar Online Class-incremental continual Learning (NO-CL), which aims to preserve the discernibility of base classes without buffering data examples and efficiently learn novel classes continuously in a single-pass (i.e., online) data stream. The challenges of this task are mainly two-fold: (1) Both base and novel classes suffer from severe catastrophic forgetting as no previous samples are available for replay. (2) As the online data can only be observed once, there is no way to fully re-train the whole model, e.g., re-calibrate the decision boundaries via prototype alignment or feature distillation. In this paper, we propose a novel Dual-prototype Self-augment and Refinement method (DSR) for NO-CL problem, which consists of two strategies: 1) Dual class prototypes: vanilla and high-dimensional prototypes are exploited to utilize the pre-trained information and obtain robust quasi-orthogonal representations rather than example buffers for both privacy preservation and memory reduction. 2) Self-augment and refinement: Instead of updating the whole network, we optimize high-dimensional prototypes alternatively with the extra projection module based on self-augment vanilla prototypes, through a bi-level optimization problem. Extensive experiments demonstrate the effectiveness and superiority of the proposed DSR in NO-CL.



Paperid:1411
Authors:Koji Ichikawa, Shinji Ito, Daisuke Hatano, Hanna Sumita, Takuro Fukunaga, Naonori Kakimura, Ken-ichi Kawarabayashi
NEC Corporation National Institute of Advanced Industrial Science and Technology, NEC Corporation National Institute of Advanced Industrial Science and Technology RIKEN AIP, RIKEN AIP, Tokyo Institute of Technology, Chuo University, Keio University, National Institute of Informatics The University of Tokyo
Abstract:
We consider the sparse contextual bandit problem where arm feature affects reward through the inner product of sparse parameters. Recent studies have developed sparsityagnostic algorithms based on the greedy arm selection policy. However, the analysis of these algorithms requires strong assumptions on the arm feature distribution to ensure that the greedily selected samples are sufficiently diverse; One of the most common assumptions, relaxed symmetry, imposes approximate origin-symmetry on the distribution, which cannot allow distributions that has origin-asymmetric support. In this paper, we show that the greedy algorithm is applicable to a wider range of the arm feature distributions from two aspects. Firstly, we show that a mixture distribution that has a greedy-applicable component is also greedy-applicable. Second, we propose new distribution classes, related to Gaussian mixture, discrete, and radial distribution, for which the sample diversity is guaranteed. The proposed classes can describe distributions with origin-asymmetric support and, in conjunction with the first claim, provide theoretical guarantees of the greedy policy for a very wide range of the arm feature distributions.



Paperid:1412
Authors:Rashidul Islam, Huiyuan Chen, Yiwei Cai
Visa Research, Visa Research, Visa Research
Abstract:
Ensuring fairness in machine learning (ML) is crucial, particularly in applications that impact diverse populations. The majority of existing works heavily rely on the availability of protected features like race and gender. However, practical challenges such as privacy concerns and regulatory restrictions often prohibit the use of this data, limiting the scope of traditional fairness research. To address this, we introduce a Shared Latent Spacebased Debiasing (SLSD) method that transforms data from both the target domain, which lacks protected features, and a separate source domain, which contains these features, into correlated latent representations. This allows for joint training of a cross-domain protected group estimator on the representations. We then debias the downstream ML model with an adversarial learning technique that leverages the group estimator. We also present a relaxed variant of SLSD, the R-SLSD, that occasionally accesses a small subset of protected features from the target domain during its training phase. Our extensive experiments on benchmark datasets demonstrate that our methods consistently outperform existing state-of-the-art models in standard group fairness metrics.



Paperid:1413
Authors:Andrei Ivanov, Stefan Ailuro
Independent Researcher, Independent Researcher
Abstract:
The paper presents Taylor Map Polynomial Neural Network (TMPNN), a novel form of very highorder polynomial regression, in which the same coefficients for a lower-to-moderate-order polynomial regression are iteratively reapplied so as to achieve a higher-order model without the number of coefficients to be fit exploding in the usual curse-of-dimensionality way. This method naturally implements multi-target regression and can capture internal relationships between targets. We also introduce an approach for model interpretation in the form of systems of differential equations. By benchmarking on Feynman regression, UCI, Friedman-1, and real-life industrial datasets, we demonstrate that the proposed method performs comparably to the state-of-the-art regression methods and outperforms them on specific tasks.



Paperid:1414
Authors:Dmitry Ivanov, Omer Ben-Porat
Technion, Israel, Technion, Israel
Abstract:
Personalization in machine learning (ML) tailors models' decisions to the individual characteristics of users. While this approach has seen success in areas like recommender systems, its expansion into highstakes fields such as healthcare and autonomous driving is hindered by the extensive regulatory approval processes involved. To address this challenge, we propose a novel framework termed represented Markov Decision Processes (r-MDPs) that is designed to balance the need for personalization with the regulatory constraints. In an r-MDP, we cater to a diverse user population, each with unique preferences, through interaction with a small set of representative policies. Our objective is twofold: efficiently match each user to an appropriate representative policy and simultaneously optimize these policies to maximize overall social welfare. We develop two deep reinforcement learning algorithms that efficiently solve r-MDPs. These algorithms draw inspiration from the principles of classic K-means clustering and are underpinned by robust theoretical foundations. Our empirical investigations, conducted across a variety of simulated environments, showcase the algorithms' ability to facilitate meaningful personalization even under constrained policy budgets. Furthermore, they demonstrate scalability, efficiently adapting to larger policy budgets.



Paperid:1415
Authors:Yacine Izza, Alexey Ignatiev, Peter J. Stuckey, Joao Marques-Silva
CREATE, National University of Singapore, Monash University, Monash University OPTIMA ARC Industrial Training and Transformation Centre, IRIT, CNRS
Abstract:
In the quest for Explainable Artificial Intelligence (XAI) one of the questions that frequently arises given a decision made by an AI system is, ``why was the decision made in this way?'' Formal approaches to explainability build a formal model of the AI system and use this to reason about the properties of the system. Given a set of feature values for an instance to be explained, and a resulting decision, a formal abductive explanation is a set of features, such that if they take the given value will always lead to the same decision. This explanation is useful, it shows that only some features were used in making the final decision. But it is narrow, it only shows that if the selected features take their given values the decision is unchanged. It is possible that some features may change values and still lead to the same decision. In this paper we formally define inflated explanations which is a set of features, and for each feature a set of values (always including the value of the instance being explained), such that the decision will remain unchanged, for any of the values allowed for any of the features in the (inflated) abductive explanation. Inflated formal explanations are more informative than common abductive explanations since e.g. they allow us to see if the exact value of a feature is important, or it could be any nearby value. Overall they allow us to better understand the role of each feature in the decision. We show that we can compute inflated explanations for not that much greater cost than abductive explanations, and that we can extend duality results for abductive explanations also to inflated explanations.



Paperid:1416
Authors:Yesukhei Jagvaral, Francois Lanusse, Rachel Mandelbaum
Carnegie Mellon University, CNRS, ParisSaclay University, Carnegie Mellon University
Abstract:
Diffusionbased generative models represent the current state-of-the-art for image generation. However, standard diffusion models are based on Euclidean geometry and do not translate directly to manifold-valued data. In this work, we develop extensions of both score-based generative models (SGMs) and Denoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D rotations, SO(3). SO(3) is of particular interest in many disciplines such as robotics, biochemistry and astronomy/cosmology science. Contrary to more general Riemannian manifolds, SO(3) admits a tractable solution to heat diffusion, and allows us to implement efficient training of diffusion models. We apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and demonstrate state-of-the-art results. Additionally, we demonstrate the practicality of our model on pose estimation tasks and in predicting correlated galaxy orientations for astrophysics/cosmology.



Paperid:1417
Authors:Abhinav Jain, Vaibhav Unhelkar
Rice University, Rice University
Abstract:
Offline imitation learning (IL) refers to learning expert behavior solely from demonstrations, without any additional interaction with the environment. Despite significant advances in offline IL, existing techniques find it challenging to learn policies for longhorizon tasks and require significant re-training when task specifications change. Towards addressing these limitations, we present GO-DICE an offline IL technique for goal-conditioned long-horizon sequential tasks. GO-DICE discerns a hierarchy of sub-tasks from demonstrations and uses these to learn separate policies for sub-task transitions and action execution, respectively; this hierarchical policy learning facilitates long-horizon reasoning.Inspired by the expansive DICE-family of techniques, policy learning at both the levels transpires within the space of stationary distributions. Further, both policies are learnt with goal conditioning to minimize need for retraining when task goals change. Experimental results substantiate that GO-DICE outperforms recent baselines, as evidenced by a marked improvement in the completion rate of increasingly challenging pick-and-place Mujoco robotic tasks. GO-DICE is also capable of leveraging imperfect demonstration and partial task segmentation when available, both of which boost task performance relative to learning from expert demonstrations alone.



Paperid:1418
Authors:Nishant Jain, Pradeep Shenoy
Google Research India, Google Research India
Abstract:
Slow concept drift is a ubiquitous, yet understudied problem in practical machine learning systems. In such settings, although recent data is more indicative of future data, naively prioritizing recent instances runs the risk of losing valuable information from the past. We propose an optimization-driven approach towards balancing instance importance over large training windows. First, we model instance relevance using a mixture of multiple timescales of decay, allowing us to capture rich temporal trends. Second, we learn an auxiliary scorer model that recovers the appropriate mixture of timescales as a function of the instance itself. Finally, we propose a nested optimization objective for learning the scorer, by which it maximizes forward transfer for the learned model. Experiments on a large real-world dataset of 39M photos over a 9 year period show upto 15% relative gains in accuracy compared to other robust learning baselines. We replicate our gains on two collections of real-world datasets for non-stationary learning, and extend our work to continual learning settings where, too, we beat SOTA methods by large margins.



Paperid:1419
Authors:Ragesh Jaiswal, Amit Kumar
IIT Delhi, IIT Delhi
Abstract:
Coresets for kmeans and k-median problems yield a small summary of the data, which preserves the clustering cost with respect to any set of k centers. Recently coresets have also been constructed for constrained k-means and k-median problems. However, the notion of coresets has the drawback that (i) they can only be applied in settings where the input points are allowed to have weights, and (ii) in general metric spaces, the size of the coresets can depend logarithmically on the number of points. The notion of weak coresets, which has less stringent requirements than coresets, has been studied in the context of classical k-means and k-median problems. A weak coreset is a pair (J,S) of subsets of points, where S acts as a summary of the point set and J as a set of potential centers. This pair satisfies the properties that (i) S is a good summary of the data as long as the k centers are chosen from J only, and (ii) there is a good choice of k centers in J with a cost close to the optimal cost. We develop this framework, which we call universal weak coresets, for constrained clustering settings. In conjunction with recent coreset constructions for constrained settings, our designs give greater data compression, are conceptually simpler, and apply to a wide range of constrained k-median and k-means problems.



Paperid:1420
Authors:Kasra Jalaldoust, Elias Bareinboim
Columbia University, Columbia University
Abstract:
One key assumption in machine learning literature is that the testing and training data come from the same distribution, which is often violated in practice. The anchors that allow generalizations to take place are causal, and provenient in terms of the stability and modularity of the mechanisms underlying the system of variables. Building on the theory of causal transportability, we define the notion of ``transportable representations", and show that these representations are suitable candidates for the domain generalization task. Specifically, considering that the graphical assumptions about the underlying system are provided, the transportable representations can be characterized accordingly, and the distribution of label conditioned on the representation can be computed in terms of the source distributions. Finally, we relax the assumption of having access to the underlying graph by proving a graphicalinvariance duality theorem, which delineates certain probabilistic invariances present in the source data as a sound and complete criterion for generalizable classification. Our findings provide a unifying theoretical basis for several existing approaches to the domain generalization problem.



Paperid:1421
Authors:Amit Jena, Dileep Kalathil, Le Xie
Texas A&M University, Texas A&M University, Texas A&M University
Abstract:
This paper addresses the problem of Neural Network (NN) based adaptive stability certification in a dynamical system. The stateof-the-art methods, such as Neural Lyapunov Functions (NLFs), use NN-based formulations to assess the stability of a non-linear dynamical system and compute a Region of Attraction (ROA) in the state space. However, under parametric uncertainty, if the values of system parameters vary over time, the NLF methods fail to adapt to such changes and may lead to conservative stability assessment performance. We circumvent this issue by integrating Model Agnostic Meta-learning (MAML) with NLFs and propose meta-NLFs. In this process, we train a meta-function that adapts to any parametric shifts and updates into an NLF for the system with new test-time parameter values. We demonstrate the stability assessment performance of meta-NLFs on some standard benchmark autonomous dynamical systems.



Paperid:1422
Authors:Qirui Ji, Jiangmeng Li, Jie Hu, Rui Wang, Changwen Zheng, Fanjiang Xu
Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences State Key Laboratory of Intelligent Game, State Key Laboratory of Computer Science, Institute of Software Chinese Academy of Sciences, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences State Key Laboratory of Intelligent Game, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Graph contrastive learning is a general learning paradigm excelling at capturing invariant information from diverse perturbations in graphs. Recent works focus on exploring the structural rationale from graphs, thereby increasing the discriminability of the invariant information. However, such methods may incur in the mislearning of graph models towards the interpretability of graphs, and thus the learned noisy and task-agnostic information interferes with the prediction of graphs. To this end, with the purpose of exploring the intrinsic rationale of graphs, we accordingly propose to capture the dimensional rationale from graphs, which has not received sufficient attention in the literature. The conducted exploratory experiments attest to the feasibility of the aforementioned roadmap. To elucidate the innate mechanism behind the performance improvement arising from the dimensional rationale, we rethink the dimensional rationale in graph contrastive learning from a causal perspective and further formalize the causality among the variables in the pre-training stage to build the corresponding structural causal model. On the basis of the understanding of the structural causal model, we propose the dimensional rationale-aware graph contrastive learning approach, which introduces a learnable dimensional rationale acquiring network and a redundancy reduction constraint. The learnable dimensional rationale acquiring network is updated by leveraging a bi-level meta-learning technique, and the redundancy reduction constraint disentangles the redundant features through a decorrelation process during learning. Empirically, compared with state-of-the-art methods, our method can yield significant performance boosts on various benchmarks with respect to discriminability and transferability. The code implementation of our method is available at https://github.com/ByronJi/DRGCL.



Paperid:1423
Authors:Shulei Ji, Xinyu Yang
Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Generating music with emotion is an important task in automatic music generation, in which emotion is evoked through a variety of musical elements (such as pitch and duration) that change over time and collaborate with each other. However, prior research on deep learningbased emotional music generation has rarely explored the contribution of different musical elements to emotions, let alone the deliberate manipulation of these elements to alter the emotion of music, which is not conducive to fine-grained element-level control over emotions. To address this gap, we present a novel approach employing musical element-based regularization in the latent space to disentangle distinct elements, investigate their roles in distinguishing emotions, and further manipulate elements to alter musical emotions. Specifically, we propose a novel VQ-VAE-based model named MusER. MusER incorporates a regularization loss to enforce the correspondence between the musical element sequences and the specific dimensions of latent variable sequences, providing a new solution for disentangling discrete sequences. Taking advantage of the disentangled latent vectors, a two-level decoding strategy that includes multiple decoders attending to latent vectors with different semantics is devised to better predict the elements. By visualizing latent space, we conclude that MusER yields a disentangled and interpretable latent space and gain insights into the contribution of distinct elements to the emotional dimensions (i.e., arousal and valence). Experimental results demonstrate that MusER outperforms the state-of-the-art models for generating emotional music in both objective and subjective evaluation. Besides, we rearrange music through element transfer and attempt to alter the emotion of music by transferring emotion-distinguishable elements.



Paperid:1424
Authors:Xinyuan Ji, Zhaowei Zhu, Wei Xi, Olga Gadyatskaya, Zilong Song, Yong Cai, Yang Liu
Xi'an Jiaotong University Leiden University, Docta.ai, Xi'an Jiaotong University, Leiden University, Xi'an Jiaotong University, IQVIA Inc. & California State University, Monterey Bay, University of California, Santa Cruz
Abstract:
Federated Learning (FL) heavily depends on label quality for its performance. However, the label distribution among individual clients is always both noisy and heterogeneous. The high loss incurred by clientspecific samples in heterogeneous label noise poses challenges for distinguishing between client-specific and noisy label samples, impacting the effectiveness of existing label noise learning approaches. To tackle this issue, we propose FedFixer, where the personalized model is introduced to cooperate with the global model to effectively select clean client-specific samples. In the dual models, updating the personalized model solely at a local level can lead to overfitting on noisy data due to limited samples, consequently affecting both the local and global models’ performance. To mitigate overfitting, we address this concern from two perspectives. Firstly, we employ a confidence regularizer to alleviate the impact of unconfident predictions caused by label noise. Secondly, a distance regularizer is implemented to constrain the disparity between the personalized and global models. We validate the effectiveness of FedFixer through extensive experiments on benchmark datasets. The results demonstrate that FedFixer can perform well in filtering noisy label samples on different clients, especially in highly heterogeneous label noise scenarios.



Paperid:1425
Authors:Yuwen Ji, Lei Shi, Zhimeng Liu, Ge Wang
Beihang University (BUAA), Beihang University (BUAA), University of Science and Technology Beijing (USTB), University of Science and Technology Beijing (USTB)
Abstract:
Explaining the decisions made by Graph Neural Networks (GNNs) is vital for establishing trust and ensuring fairness in critical applications such as medicine and science. The prevalence of hierarchical structure in realworld graphs/networks raises an important question on GNN interpretability: "On each level of the graph structure, which specific fraction imposes the highest influence over the prediction?" Currently, the prevailing two categories of methods are incapable of achieving multi-level GNN explanation due to their flat or motif-centric nature. In this work, we formulate the problem of learning multi-level explanations out of GNN models and introduce a stratified explainer module, namely STFExplainer, that utilizes the concept of sufficient expansion to generate explanations on each stratum. Specifically, we learn a higher-level subgraph generator by leveraging both hierarchical structure and GNN-encoded input features. Experiment results on both synthetic and real-world datasets demonstrate the superiority of our stratified explainer on standard interpretability tasks and metrics such as fidelity and explanation recall, with an average improvement of 11% and 8% over the best alternative on each data type. The case study on material domains also confirms the value of our approach through detected multi-level graph patterns accurately reconstructing the knowledge-based ground truth.



Paperid:1426
Authors:Yongzhe Jia, Xuyun Zhang, Amin Beheshti, Wanchun Dou
Nanjing University, Macquarie University, Macquarie University, Nanjing University
Abstract:
Federated Learning (FL) has emerged as a promising solution in Edge Computing (EC) environments to process the proliferation of data generated by edge devices. By collaboratively optimizing the global machine learning models on distributed edge devices, FL circumvents the need for transmitting raw data and enhances user privacy. Despite practical successes, FL still confronts significant challenges including constrained edge device resources, multiple tasks deployment, and data heterogeneity. However, existing studies focus on mitigating the FL training costs of each single task whereas neglecting the resource consumption across multiple tasks in heterogeneous FL scenarios. In this paper, we propose Heterogeneous Federated Learning with Local Parameter Sharing (FedLPS) to fill this gap. FedLPS leverages principles from transfer learning to facilitate the deployment of multiple tasks on a single device by dividing the local model into a shareable encoder and taskspecific encoders. To further reduce resource consumption, a channel-wise model pruning algorithm that shrinks the footprint of local models while accounting for both data and system heterogeneity is employed in FedLPS. Additionally, a novel heterogeneous model aggregation algorithm is proposed to aggregate the heterogeneous predictors in FedLPS. We implemented the proposed FedLPS on a real FL platform and compared it with state-of-the-art (SOTA) FL frameworks. The experimental results on five popular datasets and two modern DNN models illustrate that the proposed FedLPS significantly outperforms the SOTA FL frameworks by up to 4.88% and reduces the computational resource consumption by 21.3%. Our code is available at: https://github.com/jyzgh/FedLPS.



Paperid:1427
Authors:Yuheng Jia, Xiaorui Peng, Ran Wang, Min-Ling Zhang
School of Computer Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Mathematical Science, Shenzhen University Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, China, School of Computer Science and Engineering, Southeast University Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China
Abstract:
In partial label learning (PLL), each instance is associated with a set of candidate labels, among which only one is correct. The traditional PLL almost all implicitly assume that the distribution of the classes is balanced. However, in realworld applications, the distribution of the classes is imbalanced or long-tailed, leading to the long-tailed partial label learning problem. The previous methods solve this problem mainly by ameliorating the ability to learn in the tail classes, which will sacrifice the performance of the head classes. While keeping the performance of the head classes may degrade the performance of the tail classes. Therefore, in this paper, we construct two classifiers, i.e., a head classifier for keeping the performance of dominant classes and a tail classifier for improving the performance of the tail classes. Then, we propose a classifier weight estimation module to automatically estimate the shot belongingness (head class or tail class) of the samples and allocate the weights for the head classifier and tail classifier when making prediction. This cooperation improves the prediction ability for both the head classes and the tail classes. The experiments on the benchmarks demonstrate the proposed approach improves the accuracy of the SOTA methods by a substantial margin. Code and data are available at: https://github.com/pruirui/HTC-LTPLL.



Paperid:1428
Authors:Gaoxia Jiang, Jia Zhang, Xuefei Bai, Wenjian Wang, Deyu Meng
Shanxi University, Shanxi University, Shanxi University, Shanxi University, Xi'an Jiaotong University
Abstract:
Most noise cleaning methods adopt one of the correction and filtering modes to build robust models. However, their effectiveness, applicability, and hyperparameter insensitivity have not been carefully studied. We compare the two cleaning modes via a rebuilt error bound in noisy environments. At the dataset level, Theorem 5 implies that correction is more effective than filtering when the cleaned datasets have close noise rates. At the sample level, Theorem 6 indicates that confident label noises (large noise probabilities) are more suitable to be corrected, and unconfident noises (medium noise probabilities) should be filtered. Besides, an imperfect hyper-parameter may have fewer negative impacts on filtering than correction. Unlike existing methods with a single cleaning mode, the proposed Fusion cleaning framework of Correction and Filtering (FCF) combines the advantages of different modes to deal with diverse suspicious labels. Experimental results demonstrate that our FCF method can achieve state-of-the-art performance on benchmark datasets.



Paperid:1429
Authors:Haoran Jiang, Zhihao Sun, YingJie Tian
School of Mathematical Sciences, University of Chinese Academy of Sciences Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, School of Economics and Management, University of Chinese Academy of Sciences Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences MOE Social Science Laboratory of Digital Economic Forecasts and Policy Simulation at UCAS
Abstract:
Partial label learning (PLL), a significant research area, addresses the challenge of annotating each sample with a candidate label set containing the true label when obtaining accurate labels is infeasible. However, existing PLL methods often rely on generic datasets like CIFAR, where annotators can readily differentiate candidate labels and are unlikely to confuse, making it less realistic for realworld partial label applications. In response, our research focuses on a rarely studied problem, PLL on fine-grained images with attributes. And we propose a novel framework called Shared to Learn, Distinct to Disambiguate (SoDisam). Within the candidate label set, the categories may exhibit numerous shared attribute features, posing a challenge in accurately distinguishing them. Rather than perceiving it as an impediment, we capitalize on these shared attributes as definitive sources of supervision. This insight guides us to learn attribute space visual representation to focus on the information from these shared attributes. Moreover, we introduce an attribute attention mechanism tailored to harness the remaining distinct attributes. This mechanism directs the originally holistic feature towards specific regions, capturing corresponding discriminative features. In addition, a dynamic disambiguation module is introduced, continuously adjusting the two aforementioned mechanisms and achieve the final disambiguation process. Extensive experiments demonstrate the effectiveness of our approach on fine-grained partial label datasets. The proposed SoDisam framework not only addresses the challenges associated with fine-grained partial label learning but also provides a more realistic representation of real-world partial label scenarios.



Paperid:1430
Authors:Jincen Jiang, Lizhi Zhao, Xuequan Lu, Wei Hu, Imran Razzak, Meili Wang
Northwest A&F University, Northwest A&F University, La Trobe University, Peking University, University of New South Wales, Northwest A&F University
Abstract:
Recent works attempt to extend Graph Convolution Networks (GCNs) to point clouds for classification and segmentation tasks. These works tend to sample and group points to create smaller point sets locally and mainly focus on extracting local features through GCNs, while ignoring the relationship between point sets. In this paper, we propose the Dynamic Hop Graph Convolution Network (DHGCN) for explicitly learning the contextual relationships between the voxelized point parts, which are treated as graph nodes. Motivated by the intuition that the contextual information between point parts lies in the pairwise adjacent relationship, which can be depicted by the hop distance of the graph quantitatively, we devise a novel selfsupervised part-level hop distance reconstruction task and design a novel loss function accordingly to facilitate training. In addition, we propose the Hop Graph Attention (HGA), which takes the learned hop distance as input for producing attention weights to allow edge features to contribute distinctively in aggregation. Eventually, the proposed DHGCN is a plug-and-play module that is compatible with point-based backbone networks. Comprehensive experiments on different backbones and tasks demonstrate that our self-supervised method achieves state-of-the-art performance. Our source codes are available at: https://github.com/Jinec98/DHGCN.



Paperid:1431
Authors:Kui Jiang, Junjun Jiang, Xianming Liu, Xin Xu, Xianzheng Ma
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Wuhan University of Science and Technology, University of Oxford
Abstract:
The wavelet transform has emerged as a powerful tool in deciphering structural information within images. And now, the latest research suggests that combining the prowess of wavelet transform with neural networks can lead to unparalleled image deraining results. By harnessing the strengths of both the spatial domain and frequency space, this innovative approach is poised to revolutionize the field of image processing. The fascinating challenge of developing a comprehensive framework that takes into account the intrinsic frequency property and the correlation between rain residue and background is yet to be fully explored. In this work, we propose to investigate the potential relationships among rainfree and residue components at the frequency domain, forming a frequency mutual revision network (FMRNet) for image deraining. Specifically, we explore the mutual representation of rain residue and background components at frequency domain, so as to better separate the rain layer from clean background while preserving structural textures of the degraded images. Meanwhile, the rain distribution prediction from the low-frequency coefficient, which can be seen as the degradation prior is used to refine the separation of rain residue and background components. Inversely, the updated rain residue is used to benefit the low-frequency rain distribution prediction, forming the multi-layer mutual learning. Extensive experiments demonstrate that our proposed FMRNet delivers significant performance gains for seven datasets on image deraining task, surpassing the state-of-the-art method ELFormer by 1.14 dB in PSNR on the Rain100L dataset, while with similar computation cost. Code and retrained models are available at https://github.com/kuijiang94/FMRNet.



Paperid:1432
Authors:Nan Jiang, Yexiang Xue
Purdue University, Purdue University
Abstract:
Symbolic regression, as one of the most crucial tasks in AI for science, discovers governing equations from experimental data. Popular approaches based on genetic programming, Monte Carlo tree search, or deep reinforcement learning learn symbolic regression from a fixed dataset. These methods require massive datasets and long training time especially when learning complex equations involving many variables. Recently, Control Variable Genetic Programming (CVGP) has been introduced which accelerates the regression process by discovering equations from designed control variable experiments. However, the set of experiments is fixed apriori in CVGP and we observe that sub-optimal selection of experiment schedules delay the discovery process significantly. To overcome this limitation, we propose Racing Control Variable Genetic Programming (Racing-CVGP), which carries out multiple experiment schedules simultaneously. A selection scheme similar to that used in selecting good symbolic equations in the genetic programming process is implemented to ensure that promising experiment schedules eventually win over the average ones. The unfavorable schedules are terminated early to save time for the promising ones. We evaluate Racing-CVGP on several synthetic and real-world datasets corresponding to true physics laws. We demonstrate that Racing-CVGP outperforms CVGP and a series of symbolic regressors which discover equations from fixed datasets.



Paperid:1433
Authors:Yuhua Jiang, Qihan Liu, Xiaoteng Ma, Chenghao Li, Yiqin Yang, Jun Yang, Bin Liang, Qianchuan Zhao
Department of Automation, Tsinghua University, Department of Automation, Tsinghua University, Department of Automation, Tsinghua University, Department of Automation, Tsinghua University, Department of Automation, Tsinghua University, Department of Automation, Tsinghua University, Department of Automation, Tsinghua University, Department of Automation, Tsinghua University
Abstract:
Among the remarkable successes of Reinforcement Learning (RL), selfplay algorithms have played a crucial role in solving competitive games. However, current self-play RL methods commonly optimize the agent to maximize the expected win-rates against its current or historical copies, resulting in a limited strategy style and a tendency to get stuck in local optima. To address this limitation, it is important to improve the diversity of policies, allowing the agent to break stalemates and enhance its robustness when facing with different opponents. In this paper, we present a novel perspective to promote diversity by considering that agents could have diverse risk preferences in the face of uncertainty. To achieve this, we introduce a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning, enabling policy learning with desired risk preferences. Furthermore, by seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives using experiences gained from playing against diverse opponents. Our empirical results demonstrate that our method achieves comparable or superior performance in competitive games and, importantly, leads to the emergence of diverse behavioral modes. Code is available at https://github.com/Jackory/RPBT.



Paperid:1434
Authors:Zhangqi Jiang, Tingjin Luo, Xinyan Liang
National University of Defense Technology, National University of Defense Technology, Shanxi Unversity
Abstract:
Due to the efficiency of integrating semantic consensus and complementary information across different views, multiview classification methods have attracted much attention in recent years. However, multi-view data often suffers from both the miss of view features and insufficient label information, which significantly decrease the performance of traditional multi-view classification methods in practice. Learning for such simultaneous lack of feature and label is crucial but rarely studied. To tackle these problems, we propose a novel Deep Incomplete Multi-view Learning Network (DIMvLN) by incorporating graph networks and semi-supervised learning in this paper. Specifically, DIMvLN firstly designs the deep graph networks to effectively recover missing data with assigning pseudo-labels of large amounts of unlabeled instances and refine the incomplete feature information. Meanwhile, to enhance the label information, a novel pseudo-label generation strategy with the similarity constraints of unlabeled instances is proposed to exploit additional supervisory information and guide the completion module to preserve more semantic information of absent multi-view data. Besides, we design view-specific representation extractors with the autoencoder structure and contrastive loss to learn high-level semantic representations for each view, promote cross-view consistencies and augment the separability between different categories. Finally, extensive experimental results demonstrate the effectiveness of our DIMvLN, attaining noteworthy performance improvements compared to state-of-the-art competitors on several public benchmark datasets. Code will be available at GitHub.



Paperid:1435
Authors:Yang Jiao, Kai Yang, Tiancheng Wu, Chengtao Jian, Jianwei Huang
Department of Computer Science and Technology, Tongji University, Department of Computer Science and Technology, Tongji University Key Laboratory of Embedded System and Service Computing Ministry of Education at Tongji University Shanghai Research Institute for Intelligent Autonomous Systems, Department of Computer Science and Technology, Tongji University, Department of Computer Science and Technology, Tongji University, School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen Shenzhen Institute of Artificial Intelligence and Robotics for Society
Abstract:
Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decisionmaking structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes u-cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed u-cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the u-weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain ϵ-stationary point is upper bounded by O(1/ϵ²). Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80%.



Paperid:1436
Authors:Kun Jin, Tongxin Yin, Zhongzhu Chen, Zeyu Sun, Xueru Zhang, Yang Liu, Mingyan Liu
University of Michigan, Ann Arbor, University of Michigan, Ann Arbor, University of Michigan, Ann Arbor, University of Michigan, Ann Arbor, Ohio State University, UC Santa Cruz, University of Michigan, Ann Arbor
Abstract:
We consider a federated learning (FL) system consisting of multiple clients and a server, where the clients aim to collaboratively learn a common decision model from their distributed data. Unlike the conventional FL framework that assumes the client's data is static, we consider scenarios where the clients' data distributions may be reshaped by the deployed decision model. In this work, we leverage the idea of distribution shift mappings in performative prediction to formalize this modeldependent data distribution shift and propose a performative FL framework. We first introduce necessary and sufficient conditions for the existence of a unique performative stable solution and characterize its distance to the performative optimal solution. Then we propose the performative FedAvg algorithm and show that it converges to the performative stable solution at a rate of O(1/T) under both full and partial participation schemes. In particular, we use novel proof techniques and show how the clients' heterogeneity influences the convergence. Numerical results validate our analysis and provide valuable insights into real-world applications.



Paperid:1437
Authors:Lyudong Jin, Ming Tang, Meng Zhang, Hao Wang
Zhejiang University, Southern University of Science and Technology, Zhejiang University, Monash University
Abstract:
Mobile edge computing (MEC) is a promising paradigm for realtime applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-of-Information (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The uncertain edge load dynamics, the nature of the fractional objective, and hybrid continuous-discrete action space (due to the joint optimization) make this problem challenging and existing approaches not directly applicable. To this end, we propose a fractional reinforcement learning (RL) framework and prove its convergence. We further design a model-free fractional deep RL (DRL) algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks.



Paperid:1438
Authors:Tianyuan Jin, Hao-Lun Hsu, William Chang, Pan Xu
National University of Singapore, Duke University, University of California, Los Angeles, Duke University
Abstract:
We study the multiagent multi-armed bandit (MAMAB) problem, where agents are factored into overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the epsilon-exploring Multi-Agent Thompson Sampling (eps-MATS) algorithm, which performs MATS exploration with probability epsilon while adopts a greedy policy otherwise. We prove that eps-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of eps-MATS compared with existing algorithms in the same setting.



Paperid:1439
Authors:Yufei Jin, Richard Gao, Yi He, Xingquan Zhu
Florida Atlantic University, Rice University, Old Dominion University, Florida Atlantic University
Abstract:
Label Distribution Learning (LDL), as a more general learning setting than generic singlelabel and multi-label learning, has been commonly used in computer vision and many other applications. To date, existing LDL approaches are designed and applied to data without considering the interdependence between instances. In this paper, we propose a Graph Label Distribution Learning (GLDL) framework, which explicitly models three types of relationships: instance-instance, label-label, and instance-label, to learn the label distribution for networked data. A label-label network is learned to capture label-to-label correlation, through which GLDL can accurately learn label distributions for nodes. Dual graph convolution network (GCN) Co-training with heterogeneous message passing ensures two GCNs, one focusing on instance-instance relationship and the other one targeting label-label correlation, are jointly trained such that instance-instance relationship can help induce label-label correlation and vice versa. Our theoretical study derives the error bound of GLDL. For verification, four benchmark datasets with label distributions for nodes are created using common graph benchmarks. The experiments show that considering dependency helps learn better label distributions for networked data, compared to state-of-the-art LDL baseline. In addition, GLDL not only outperforms simple GCN and graph attention networks (GAT) using distribution loss but is also superior to its variant considering label-label relationship as a static network. GLDL and its benchmarks are the first research endeavors to address LDL for graphs. Code and benchmark data are released for public access.



Paperid:1440
Authors:Baoyu Jing, Yuchen Yan, Kaize Ding, Chanyoung Park, Yada Zhu, Huan Liu, Hanghang Tong
University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign, Northwestern University, Korea Advanced Institute of Science & Technology, MIT-IBM Watson AI Lab, IBM Research, Arizona State University, University of Illinois at Urbana-Champaign
Abstract:
A fundamental challenge of bipartite graph representation learning is how to extract informative node embeddings. SelfSupervised Learning (SSL) is a promising paradigm to address this challenge. Most recent bipartite graph SSL methods are based on contrastive learning which learns embeddings by discriminating positive and negative node pairs. Contrastive learning usually requires a large number of negative node pairs, which could lead to computational burden and semantic errors. In this paper, we introduce a novel synergistic representation learning model (STERLING) to learn node embeddings without negative node pairs. STERLING preserves the unique local and global synergies in bipartite graphs. The local synergies are captured by maximizing the similarity of the inter-type and intra-type positive node pairs, and the global synergies are captured by maximizing the mutual information of co-clusters. Theoretical analysis demonstrates that STERLING could improve the connectivity between different node types in the embedding space. Extensive empirical evaluation on various benchmark datasets and tasks demonstrates the effectiveness of STERLING for extracting node embeddings.



Paperid:1441
Authors:Yonghyeon Jo, Sunwoo Lee, Junghyuk Yeom, Seungyul Han
UNIST, UNIST, UNIST, UNIST
Abstract:
Recently, deep multiagent reinforcement learning (MARL) has gained significant popularity due to its success in various cooperative multi-agent tasks. However, exploration still remains a challenging problem in MARL due to the partial observability of the agents and the exploration space that can grow exponentially as the number of agents increases. Firstly, in order to address the scalability issue of the exploration space, we define a formation-based equivalence relation on the exploration space and aim to reduce the search space by exploring only meaningful states in different formations. Then, we propose a novel formation-aware exploration (FoX) framework that encourages partially observable agents to visit the states in diverse formations by guiding them to be well aware of their current formation solely based on their own observations. Numerical results show that the proposed FoX framework significantly outperforms the state-of-the-art MARL algorithms on Google Research Football (GRF) and sparse Starcraft II multi-agent challenge (SMAC) tasks.



Paperid:1442
Authors:Harshit Joshi, Abishai Ebenezer, José Cambronero Sanchez, Sumit Gulwani, Aditya Kanade, Vu Le, Ivan Radiček, Gust Verbruggen
Stanford University, University of Washington, Microsoft, Microsoft, Microsoft Research, Microsoft, Demiurg, Microsoft
Abstract:
Spreadsheets are a vital tool for enduser data management. Using large language models for formula authoring assistance in these environments can be difficult, as these models are expensive to train and challenging to deploy due to their size (up to billions of parameters). We present FLAME, a transformer-based model trained exclusively on Excel formulas that leverages domain insights to achieve competitive performance while being substantially smaller (60M parameters) and training on two orders of magnitude less data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction and noisy auto-encoding as pre-training objectives. We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval. FLAME can outperform much larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and GraphCodeBERT.



Paperid:1443
Authors:Shalmali Joshi, Junzhe Zhang, Elias Bareinboim
Columbia University, Columbia University, Columbia University
Abstract:
Learning personalized treatment policies is a formative challenge in many realworld applications, including in healthcare, econometrics, artificial intelligence. However, the effectiveness of candidate policies is not always identifiable, i.e., it is not uniquely computable from the combination of the available data and assumptions about the generating mechanisms. This paper studies policy learning from data collected in various non-identifiable settings, i.e., (1) observational studies with unobserved confounding; (2) randomized experiments with partial observability; and (3) their combinations. We derive sharp, closed-formed bounds from observational and experimental data over the conditional treatment effects. Based on these novel bounds, we further characterize the problem of safe policy learning and develop an algorithm that trains a policy from data guaranteed to achieve, at least, the performance of the baseline policy currently deployed. Finally, we validate our proposed algorithm on synthetic data and a large clinical trial, demonstrating that it guarantees safe behaviors and robust performance.



Paperid:1444
Authors:Chanyong Jung, Gihyun Kwon, Jong Chul Ye
Department of Brain and Bio Engineering, KAIST, Daejeon, Republic of Korea, Department of Brain and Bio Engineering, KAIST, Daejeon, Republic of Korea, Department of Brain and Bio Engineering, KAIST, Daejeon, Republic of Korea Kim Jaechul Graduate School of AI, KAIST, Daejeon, Republic of Korea
Abstract:
Recently, patchwise contrastive learning is drawing attention for the image translation by exploring the semantic correspondence between the input image and the output image. To further explore the patch-wise topology for high-level semantic understanding, here we exploit the graph neural network to capture the topology-aware features. Specifically, we construct the graph based on the patch-wise similarity from a pretrained encoder, whose adjacency matrix is shared to enhance the consistency of patch-wise relation between the input and the output. Then, we obtain the node feature from the graph neural network, and enhance the correspondence between the nodes by increasing mutual information using the contrastive loss. In order to capture the hierarchical semantic structure, we further propose the graph pooling. Experimental results demonstrate the state-of-art results for the image translation thanks to the semantic encoding by the constructed graphs.



Paperid:1445
Authors:Andrew B. Kahng, Robert R. Nerem, Yusu Wang, Chien-Yi Yang
University of California San Diego, University of California San Diego, University of California San Diego, University of California San Diego
Abstract:
Recent years have witnessed rapid advances in the use of neural networks to solve combinatorial optimization problems. Nevertheless, designing the "right" neural model that can effectively handle a given optimization problem can be challenging, and often there is no theoretical understanding or justification of the resulting neural model. In this paper, we focus on the rectilinear Steiner minimum tree (RSMT) problem, which is of critical importance in IC layout design and as a result has attracted numerous heuristic approaches in the VLSI literature. Our contributions are twofold. On the methodology front, we propose NN-Steiner which is a novel mixed neural-algorithmic framework for computing RSMTs that leverages the celebrated PTAS algorithmic framework of Arora to solve this problem (and other geometric optimization problems). Our NN-Steiner replaces key algorithmic components within Arora's PTAS by suitable neural components. In particular, NN-Steiner only needs four neural network (NN) components that are called repeatedly within an algorithmic framework. Crucially, each of the four NN components is only of bounded size independent of input size, and thus easy to train. Furthermore, as the NN component is learning a generic algorithmic step, once learned, the resulting mixed neural-algorithmic framework generalizes to much larger instances not seen in training. Our NN-Steiner, to our best knowledge, is the first neural architecture of bounded size that has capacity to approximately solve RSMT (and variants). On the empirical front, we show how NN-Steiner can be implemented and demonstrate the effectiveness of our resulting approach, especially in terms of generalization, by comparing with state-of-the-art methods (both neural and non-neural based).



Paperid:1446
Authors:Neha Kalibhat, Kanika Narang, Hamed Firooz, Maziar Sanjabi, Soheil Feizi
University of Maryland, College Park, Meta AI, Meta AI, Meta AI, University of Maryland, College Park
Abstract:
Selfsupervised learning (SSL) has shown impressive results in downstream classification tasks. However, there is limited work in understanding their failure modes and interpreting their learned representations. In this paper, we study the representation space of state-of-the-art self-supervised models including SimCLR, SwaV, MoCo, BYOL, DINO, SimSiam, VICReg and Barlow Twins. Without the use of class label information, we discover discriminative features that correspond to unique physical attributes in images, present mostly in correctly-classified representations. Using these features, we can compress the representation space by up to$40% without significantly affecting linear classification performance. We then propose Self-Supervised Representation Quality Score (or Q-Score), an unsupervised score that can reliably predict if a given sample is likely to be mis-classified during linear evaluation, achieving AUPRC of 91.45 on ImageNet-100 and 78.78 on ImageNet-1K. Q-Score can also be used as a regularization term on pre-trained encoders to remedy low-quality representations. Fine-tuning with Q-Score regularization can boost the linear probing accuracy of SSL models by up to 5.8% on ImageNet-100 and 3.7% on ImageNet-1K compared to their baselines. Finally, using gradient heatmaps and Salient ImageNet masks, we define a metric to quantify the interpretability of each representation. We show that discriminative features are strongly correlated to core attributes and, enhancing these features through Q-score regularization makes SSL representations more interpretable.



Paperid:1447
Authors:Haneol Kang, Dong-Wan Choi
Inha University, Inha University
Abstract:
The stabilityplasticity dilemma is a major challenge in continual learning, as it involves balancing the conflicting objectives of maintaining performance on previous tasks while learning new tasks. In this paper, we propose the recalloriented continual learning framework to address this challenge. Inspired by the human brain’s ability to separate the mechanisms responsible for stability and plasticity, our framework consists of a two-level architecture where an inference network effectively acquires new knowledge and a generative network recalls past knowledge when necessary. In particular, to maximize the stability of past knowledge, we investigate the complexity of knowledge depending on different representations, and thereby introducing generative adversarial meta-model (GAMM) that incrementally learns task-specific parameters instead of input data samples of the task. Through our experiments, we show that our framework not only effectively learns new knowledge without any disruption but also achieves high stability of previous knowledge in both task-aware and task-agnostic learning scenarios. Our code is available at: https://github.com/bigdata-inha/recall-orientedcl-framework.



Paperid:1448
Authors:Qiyu Kang, Kai Zhao, Yang Song, Yihang Xie, Yanan Zhao, Sijie Wang, Rui She, Wee Peng Tay
Nanyang Technological University, Nanyang Technological University, C3 AI, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University
Abstract:
In this work, we rigorously investigate the robustness of graph neural fractionalorder differential equation (FDE) models. This framework extends beyond traditional graph neural (integer-order) ordinary differential equation (ODE) models by implementing the time-fractional Caputo derivative. Utilizing fractional calculus allows our model to consider long-term memory during the feature updating process, diverging from the memoryless Markovian updates seen in traditional graph neural ODE models. The superiority of graph neural FDE models over graph neural ODE models has been established in environments free from attacks or perturbations. While traditional graph neural ODE models have been verified to possess a degree of stability and resilience in the presence of adversarial attacks in existing literature, the robustness of graph neural FDE models, especially under adversarial conditions, remains largely unexplored. This paper undertakes a detailed assessment of the robustness of graph neural FDE models. We establish a theoretical foundation outlining the robustness characteristics of graph neural FDE models, highlighting that they maintain more stringent output perturbation bounds in the face of input and graph topology disturbances, compared to their integer-order counterparts. Our empirical evaluations further confirm the enhanced robustness of graph neural FDE models, highlighting their potential in adversarially robust applications.



Paperid:1449
Authors:Taniya Kapoor, Abhishek Chandra, Daniel M. Tartakovsky, Hongrui Wang, Alfredo Nunez, Rolf Dollevoet
TU Delft, Eindhoven University of Technology, Stanford University, TU Delft, TU Delft, TU Delft
Abstract:
A primary challenge of physicsinformed machine learning (PIML) is its generalization beyond the training domain, especially when dealing with complex physical problems represented by partial differential equations (PDEs). This paper aims to enhance the generalization capabilities of PIML, facilitating practical, real-world applications where accurate predictions in unexplored regions are crucial. We leverage the inherent causality and temporal sequential characteristics of PDE solutions to fuse PIML models with recurrent neural architectures based on systems of ordinary differential equations, referred to as neural oscillators. Through effectively capturing long-time dependencies and mitigating the exploding and vanishing gradient problem, neural oscillators foster improved generalization in PIML tasks. Extensive experimentation involving time-dependent nonlinear PDEs and biharmonic beam equations demonstrates the efficacy of the proposed approach. Incorporating neural oscillators outperforms existing state-of-the-art methods on benchmark problems across various metrics. Consequently, the proposed method improves the generalization capabilities of PIML, providing accurate solutions for extrapolation and prediction beyond the training data.



Paperid:1450
Authors:Sanjay Kariyappa, Leonidas Tsepenekas, Freddy Lécué, Daniele Magazzeni
JPMorganChase AI Research, JPMorganChase AI Research, JPMorganChase AI Research, JPMorganChase AI Research
Abstract:
The SHAP framework provides a principled method to explain the predictions of a model by computing feature importance. Motivated by applications in finance, we introduce the Topk Identification Problem (TkIP) (and its ordered variant TkIP- O), where the objective is to identify the subset (or ordered subset for TkIP-O) of k features corresponding to the highest SHAP values with PAC guarantees. While any sampling-based method that estimates SHAP values (such as KernelSHAP and SamplingSHAP) can be trivially adapted to solve TkIP, doing so is highly sample inefficient. Instead, we leverage the connection between SHAP values and multi-armed bandits (MAB) to show that both TkIP and TkIP-O can be reduced to variants of problems in MAB literature. This reduction allows us to use insights from the MAB literature to develop sample-efficient variants of KernelSHAP and SamplingSHAP. We propose KernelSHAP@k and SamplingSHAP@k for solving TkIP; along with KernelSHAP-O and SamplingSHAP-O to solve the ordering problem in TkIP-O. We perform extensive experiments using several credit-related datasets to show that our methods offer significant improvements of up to 40× in sample efficiency and 39× in runtime.



Paperid:1451
Authors:Nikolai Karpov, Qin Zhang
Indiana University, Indiana University
Abstract:
In this paper, we study the collaborative learning model, which concerns the tradeoff between parallelism and communication overhead in multiagent multi-armed bandits. For regret minimization in multi-armed bandits, we present the first set of tradeoffs between the number of rounds of communication between the agents and the regret of the collaborative learning process.



Paperid:1452
Authors:Amirreza Kazemi, Martin Ester
Simon Fraser University, Simon Fraser University
Abstract:
Individual treatment effect (ITE) estimation requires adjusting for the covariate shift between populations with different treatments, and deep representation learning has shown great promise in learning a balanced representation of covariates. However the existing methods mostly consider the scenario of binary treatments. In this paper, we consider the more practical and challenging scenario in which the treatment is a continuous variable (e.g. dosage of a medication), and we address the two main challenges of this setup. We propose the adversarial counterfactual regression network (ACFR) that adversarially minimizes the representation imbalance in terms of KL divergence, and also maintains the impact of the treatment value on the outcome prediction by leveraging an attention mechanism. Theoretically we demonstrate that ACFR objective function is grounded in an upper bound on counterfactual outcome prediction error. Our experimental evaluation on semisynthetic datasets demonstrates the empirical superiority of ACFR over a range of state-of-the-art methods.



Paperid:1453
Authors:Gwladys Kelodjou, Laurence Rozé, Véronique Masson, Luis Galárraga, Romaric Gaudel, Maurice Tchuente, Alexandre Termier
Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F35000 Rennes, France, Univ Rennes, INSA Rennes, CNRS, Inria, IRISA - UMR 6074, F35000 Rennes, France, Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F35000 Rennes, France, Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F35000 Rennes, France, Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F35000 Rennes, France, Sorbonne University, IRD, University of Yaoundé I, UMI 209 UMMISCO, BP 337 Yaoundé, Cameroon, Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F35000 Rennes, France
Abstract:
Machine learning techniques, such as deep learning and ensemble methods, are widely used in various domains due to their ability to handle complex realworld tasks. However, their black-box nature has raised multiple concerns about the fairness, trustworthiness, and transparency of computer-assisted decision-making. This has led to the emergence of local post-hoc explainability methods, which offer explanations for individual decisions made by black-box algorithms. Among these methods, Kernel SHAP is widely used due to its model-agnostic nature and its well-founded theoretical framework. Despite these strengths, Kernel SHAP suffers from high instability: different executions of the method with the same inputs can lead to significantly different explanations, which diminishes the relevance of the explanations. The contribution of this paper is two-fold. On the one hand, we show that Kernel SHAP's instability is caused by its stochastic neighbor selection procedure, which we adapt to achieve full stability without compromising explanation fidelity. On the other hand, we show that by restricting the neighbors generation to perturbations of size 1 -- which we call the coalitions of Layer 1 -- we obtain a novel feature-attribution method that is fully stable, computationally efficient, and still meaningful.



Paperid:1454
Authors:Dongha Kim, Yongchan Choi, Kunwoong Kim, Ilsang Ohn, Yongdai Kim
Department of Statistics, Sungshin Women's University, Toss bank, Department of Statistics, Seoul National University, Department of Statistics, Inha University, Department of Statistics, Seoul National University
Abstract:
Most recent stateof-the-art algorithms for handling noisy label problems are based on the memorization effect, which is a phenomenon that deep neural networks (DNNs) memorize clean data before noisy ones. While the memorization effect can be a powerful tool, there are several cases where memorization effect does not occur. Examples are imbalanced class distributions and heavy contamination on labels. To address this limitation, we introduce a whole new approach called the interpolation with the over-fitted model (IOFM), which leverages over-fitted deep neural networks. The IOFM utilizes a new finding of over-fitted DNNs: for a given training sample, its neighborhoods chosen from the feature space are distributed differently on the original input space depending on the cleanness of the target sample. The IOFM has notable features in two aspects: 1) it yields superior results even when the training data are imbalanced or heavily noisy, 2) since we utilize over-fitted deep neural networks, a fine-tuning procedure to select the optimal training epoch, which is an essential yet sensitive factor for the success of the memorization effect, is not required, and thus, the IOFM can be used for non-experts. Through extensive experiments, we show that our method can serve as a promising alternative to existing solutions dealing with noisy labels, offering improved performance even in challenging situations.



Paperid:1455
Authors:Dongmin Kim, Sunghyun Park, Jaegul Choo
KAIST, KAIST, KAIST
Abstract:
Timeseries anomaly detection deals with the problem of detecting anomalous timesteps by learning normality from the sequence of observations. However, the concept of normality evolves over time, leading to a "new normal problem", where the distribution of normality can be changed due to the distribution shifts between training and test data. This paper highlights the prevalence of the new normal problem in unsupervised time-series anomaly detection studies. To tackle this issue, we propose a simple yet effective test-time adaptation strategy based on trend estimation and a self-supervised approach to learning new normalities during inference. Extensive experiments on real-world benchmarks demonstrate that incorporating the proposed strategy into the anomaly detector consistently improves the model's performances compared to the existing baselines, leading to robustness to the distribution shifts.



Paperid:1456
Authors:Doyoung Kim, Dongmin Park, Yooju Shin, Jihwan Bang, Hwanjun Song, Jae-Gil Lee
KAIST, Daejeon, Republic of Korea, KAIST, Daejeon, Republic of Korea, KAIST, Daejeon, Republic of Korea, KAIST, Daejeon, Republic of Korea, KAIST, Daejeon, Republic of Korea, KAIST, Daejeon, Republic of Korea
Abstract:
We propose a novel framework DropTop that suppresses the shortcut bias in online continual learning (OCL) while being adaptive to the varying degree of the shortcut bias incurred by continuously changing environment. By the observed highattention property of the shortcut bias, highly-activated features are considered candidates for debiasing. More importantly, resolving the limitation of the online environment where prior knowledge and auxiliary data are not ready, two novel techniques---feature map fusion and adaptive intensity shifting---enable us to automatically determine the appropriate level and proportion of the candidate shortcut features to be dropped. Extensive experiments on five benchmark datasets demonstrate that, when combined with various OCL algorithms, DropTop increases the average accuracy by up to 10.4% and decreases the forgetting by up to 63.2%.



Paperid:1457
Authors:Han-Byul Kim, Joo Hyung Lee, Sungjoo Yoo, Hong-Seok Kim
Seoul National University Google, Google, Seoul National University, Google
Abstract:
Mixedprecision quantization of efficient networks often suffer from activation instability encountered in the exploration of bit selections. To address this problem, we propose a novel method called MetaMix which consists of bit selection and weight training phases. The bit selection phase iterates two steps, (1) the mixed-precision-aware weight update, and (2) the bit-search training with the fixed mixed-precision-aware weights, both of which combined reduce activation instability in mixed-precision quantization and contribute to fast and high-quality bit selection. The weight training phase exploits the weights and step sizes trained in the bit selection phase and fine-tunes them thereby offering fast training. Our experiments with efficient and hard-to-quantize networks, i.e., MobileNet v2 and v3, and ResNet-18 on ImageNet show that our proposed method pushes the boundary of mixed-precision quantization, in terms of accuracy vs. operations, by outperforming both mixed- and single-precision SOTA methods.



Paperid:1458
Authors:Juyeop Kim, Junha Park, Songkuk Kim, Jong-Seok Lee
Yonsei University, Yonsei University, Yonsei University, Yonsei University
Abstract:
Neural networks with selfattention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs). However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident. This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by empirically investigating how the output of the penultimate layer moves in the representation space as the input data moves linearly within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. In other words, there does exist a decision boundary near the data, which is hard to find only because of the curved representation space. This explains the underconfident prediction of Transformers. Also, we examine mathematical properties of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our additional findings, regarding what contributes to the curved representation space of Transformers, and how the curvedness evolves during training.



Paperid:1459
Authors:Kwang In Kim
POSTECH
Abstract:
We study the distributed gradient aggregation problem where individual clients contribute to learning a central model by sharing parameter gradients constructed from local losses. However, errors in some gradients, caused by lowquality data or adversaries, can degrade the learning process when naively combined. Existing robust gradient aggregation approaches assume that local data represent the global data-generating distribution, which may not always apply to heterogeneous (non-i.i.d.) client data. We propose a new algorithm that can robustly aggregate gradients from potentially heterogeneous clients. Our approach leverages the manifold structure inherent in heterogeneous client gradients and evaluates gradient anomaly degrees by projecting them onto this manifold. This algorithm is implemented as a simple and efficient method that accumulates random projections within the subspace defined by the nearest neighbors within a gradient cloud. Our experiments demonstrate consistent performance improvements over state-of-the-art robust aggregation algorithms.



Paperid:1460
Authors:Sungyoon Kim, Yunseon Choi, Daiki E. Matsunaga, Kee-Eung Kim
KAIST, KAIST, KAIST, KAIST
Abstract:
Offline GoalConditioned Reinforcement Learning (Offline GCRL) is an important problem in RL that focuses on acquiring diverse goal-oriented skills solely from pre-collected behavior datasets. In this setting, the reward feedback is typically absent except when the goal is achieved, which makes it difficult to learn policies especially from a finite dataset of suboptimal behaviors. In addition, realistic scenarios involve long-horizon planning, which necessitates the extraction of useful skills within sub-trajectories. Recently, the conditional diffusion model has been shown to be a promising approach to generate high-quality long-horizon plans for RL. However, their practicality for the goal-conditioned setting is still limited due to a number of technical assumptions made by the methods. In this paper, we propose SSD (Sub-trajectory Stitching with Diffusion), a model-based offline GCRL method that leverages the conditional diffusion model to address these limitations. In summary, we use the diffusion model that generates future plans conditioned on the target goal and value, with the target value estimated from the goal-relabeled offline dataset. We report state-of-the-art performance in the standard benchmark set of GCRL tasks, and demonstrate the capability to successfully stitch the segments of suboptimal trajectories in the offline data to generate high-quality plans.



Paperid:1461
Authors:Taehoon Kim, Jaeyoo Park, Bohyung Han
Seoul National University, Seoul National University, Seoul National University
Abstract:
We propose a novel class incremental learning approach, which incorporates a feature augmentation technique motivated by adversarial attacks. We employ a classifier learned in the past to complement training examples of previous tasks. The proposed approach has an unique perspective to utilize the previous knowledge in class incremental learning since it augments features of arbitrary target classes using examples in other classes via adversarial attacks on a previously learned classifier. By allowing the CrossClass Feature Augmentations (CCFA), each class in the old tasks conveniently populates samples in the feature space, which alleviates the collapse of the decision boundaries caused by sample deficiency for the previous tasks, especially when the number of stored exemplars is small. This idea can be easily incorporated into existing class incremental learning algorithms without any architecture modification. Extensive experiments on the standard benchmarks show that our method consistently outperforms existing class incremental learning methods by significant margins in various scenarios, especially under an environment with an extremely limited memory budget.



Paperid:1462
Authors:Woo Kyung Kim, Minjong Yoo, Honguk Woo
Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University
Abstract:
Skillbased reinforcement learning (RL) approaches have shown considerable promise, especially in solving long-horizon tasks via hierarchical structures. These skills, learned task-agnostically from offline datasets, can accelerate the policy learning process for new tasks. Yet, the application of these skills in different domains remains restricted due to their inherent dependency on the datasets, which poses a challenge when attempting to learn a skill-based policy via RL for a target domain different from the datasets' domains. In this paper, we present a novel offline skill learning framework DuSkill which employs a guided Diffusion model to generate versatile skills extended from the limited skills in datasets, thereby enhancing the robustness of policy learning for tasks in different domains. Specifically, we devise a guided diffusion-based skill decoder in conjunction with the hierarchical encoding to disentangle the skill embedding space into two distinct representations, one for encapsulating domain-invariant behaviors and the other for delineating the factors that induce domain variations in the behaviors. Our DuSkill framework enhances the diversity of skills learned offline, thus enabling to accelerate the learning procedure of high-level policies for different domains. Through experiments, we show that DuSkill outperforms other skill-based imitation learning and RL algorithms for several long-horizon tasks, demonstrating its benefits in few-shot imitation and online RL.



Paperid:1463
Authors:Woosung Kim, Donghyeon Ki, Byung-Jun Lee
Korea University, Korea University, Korea University
Abstract:
One of the major challenges of offline reinforcement learning (RL) is dealing with distribution shifts that stem from the mismatch between the trained policy and the data collection policy. Stationary distribution correction estimation algorithms (DICE) have addressed this issue by regularizing the policy optimization with fdivergence between the state-action visitation distributions of the data collection policy and the optimized policy. While such regularization naturally integrates to derive an objective to get optimal state-action visitation, such an implicit policy optimization framework has shown limited performance in practice. We observe that the reduced performance is attributed to the biased estimate and the properties of conjugate functions of f-divergence regularization. In this paper, we improve the regularized implicit policy optimization framework by relieving the bias and reshaping the conjugate function by relaxing the constraints. We show that the relaxation adjusts the degree of involvement of the sub-optimal samples in optimization, and we derive a new offline RL algorithm that benefits from the relaxed framework, improving from a previous implicit policy optimization algorithm by a large margin.



Paperid:1464
Authors:Young-Jin Kim, Min-Jun Kim, Kyunghwan An, Jinwoo Ahn, Jaeseok Kim, Yu-Jung Heo, Du-Seong Chang, Eun-Sol Kim
Hanyang University, Hanyang University, Hanyang University, Hanyang University, KT Corporation, KT Corporation, KT Corporation, Hanyang University
Abstract:
With the ability to collect vast amounts of image and natural language data from the web, there has been a remarkable advancement in Largescale Language Models (LLMs). This progress has led to the emergence of chatbots and dialogue systems capable of fluent conversations with humans. As the variety of devices enabling interactions between humans and agents expands, and the performance of text-based dialogue systems improves, there has been recently proposed research on visual dialog. However, visual dialog requires understanding sequences of pairs consisting of images and sentences, making it challenging to gather sufficient data for training large-scale models from the web. In this paper, we propose a new multimodal learning method leveraging existing large-scale models designed for each modality, to enable model training for visual dialog with small visual dialog datasets. The key ideas of our approach are: 1) storing the history or context during the progression of visual dialog in the form of spatiotemporal graphs, and 2) introducing small modulation blocks between modality-specific models and the graphs to align the semantic spaces. For implementation, we introduce a novel structure-aware cross-attention method, which retrieves relevant image and text knowledge for utterance generation from the pretrained models. For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET.



Paperid:1465
Authors:Rolando Kindelan Nuñez, Mircea Petrache, Mauricio Cerda, Nancy Hitschfeld
Universidad de Chile, UC Chile, Universidad de Chile, Universidad de Chile
Abstract:
Persistence diagrams (PD)s play a central role in topological data analysis, and are used in an ever increasing variety of applications. The comparison of PD data requires computing distances among large sets of PDs, with metrics which are accurate, theoretically sound, and fast to compute. Especially for denser multidimensional PDs, such comparison metrics are lacking. While on the one hand, Wasserstein-type distances have high accuracy and theoretical guarantees, they incur high computational cost. On the other hand, distances between vectorizations such as Persistence Statistics (PS)s have lower computational cost, but lack the accuracy guarantees and theoretical properties of a true distance over PD space. In this work we introduce a class of pseudodistances called Extended Topological Pseudodistances (ETD)s, which have tunable complexity, and can approximate Sliced and classical Wasserstein distances at the high-complexity extreme, while being computationally lighter and close to Persistence Statistics at the lower complexity extreme, and thus allow users to interpolate between the two metrics. We build theoretical comparisons to show how to fit our new distances at an intermediate level between persistence vectorizations and Wasserstein distances. We also experimentally verify that ETDs outperform PSs in terms of accuracy and outperform Wasserstein and Sliced Wasserstein distances in terms of computational complexity.



Paperid:1466
Authors:Kathryn E. Kirchoff, Travis Maxfield, Alexander Tropsha, Shawn M. Gomez
Department of Computer Science, UNC Chapel Hill, Eshelman School of Pharmacy, UNC Chapel Hill, Eshelman School of Pharmacy, UNC Chapel Hill, Department of Pharmacology, UNC Chapel Hill Joint Department of Biomedical Engineering at UNC Chapel Hill and NC State University
Abstract:
In deep learning for drug discovery, molecular representations are often based on sequences, known as SMILES, which allow for straightforward implementation of natural language processing methodologies, one being the sequenceto-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are specified by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that SMILES-based autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not necessarily respect the semantic similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA) for molecular representations: a SMILES-based transformer autoencoder modified with a contrastive task aimed at learning graph-to-graph similarities between molecules. To accomplish this, we develop a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We evaluate semantic awareness of SALSA representations by comparing to its ablated counterparts, and show empirically that SALSA learns representations that maintain 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity.



Paperid:1467
Authors:Ben Kizaric, Daniel Pimentel-Alarcón
University of Wisconsin - Madison, University of Wisconsin - Madison
Abstract:
Low dimensional models like PCA are often used to simplify complex datasets by learning a single approximating subspace. This paradigm has expanded to union of subspaces models, like those learned by subspace clustering. In this paper, we present Principal Component Trees (PCTs), a graph structure that generalizes these ideas to identify mixtures of components that together describe the subspace structure of highdimensional datasets. Each node in a PCT corresponds to a principal component of the data, and the edges between nodes indicate the components that must be mixed to produce a subspace that approximates a portion of the data. In order to construct PCTs, we propose two angle-distribution hypothesis tests to detect subspace clusters in the data. To analyze, compare, and select the best PCT model, we define two persistent homology measures that describe their shape. We show our construction yields two key properties of PCTs, namely ancestral orthogonality and non-decreasing singular values. Our main theoretical results show that learning PCTs reduces to PCA under multivariate normality, and that PCTs are efficient parameterizations of intersecting union of subspaces. Finally, we use PCTs to analyze neural network latent space, word embeddings, and reference image datasets.



Paperid:1468
Authors:Rune Kjærsgaard, Ahcène Boubekki, Line Clemmensen
Department of Applied Mathematics and Computer Science, Technical University of Denmark, Denmark, Machine Learning and Uncertainty, Physikalisch-Technische Bundesanstalt, Germany, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Denmark
Abstract:
Prototypical selfexplainable classifiers have emerged to meet the growing demand for interpretable AI systems. These classifiers are designed to incorporate high transparency in their decisions by basing inference on similarity with learned prototypical objects. While these models are designed with diversity in mind, the learned prototypes often do not sufficiently represent all aspects of the input distribution, particularly those in low density regions. Such lack of sufficient data representation, known as representation bias, has been associated with various detrimental properties related to machine learning diversity and fairness. In light of this, we introduce pantypes, a new family of prototypical objects designed to capture the full diversity of the input distribution through a sparse set of objects. We show that pantypes can empower prototypical self-explainable models by occupying divergent regions of the latent space and thus fostering high diversity, interpretability and fairness.



Paperid:1469
Authors:Masahiro Kohjima
NTT Corporation
Abstract:
Shuffled regression is the problem of learning regression models from shuffled data that consists of a set of input features and a set of target outputs where the correspondence between the input and output is unknown. This study proposes a new deep learning method for shuffled regression called Shuffled Deep Regression (SDR). We derive the sparse and stochastic variant of the ExpectationMaximization algorithm for SDR that iteratively updates discrete latent variables and the parameters of neural networks. The effectiveness of the proposal is confirmed by benchmark data experiments.



Paperid:1470
Authors:Patrick Kolpaczki, Viktor Bengs, Maximilian Muschalik, Eyke Hüllermeier
Paderborn University, Institute of Informatics, University of Munich (LMU) Munich Center for Machine Learning, Institute of Informatics, University of Munich (LMU) Munich Center for Machine Learning, Institute of Informatics, University of Munich (LMU) Munich Center for Machine Learning
Abstract:
The Shapley value, which is arguably the most popular approach for assigning a meaningful contribution value to players in a cooperative game, has recently been used intensively in explainable artificial intelligence. Its meaningfulness is due to axiomatic properties that only the Shapley value satisfies, which, however, comes at the expense of an exact computation growing exponentially with the number of agents. Accordingly, a number of works are devoted to the efficient approximation of the Shapley value, most of them revolve around the notion of an agent's marginal contribution. In this paper, we propose with SVARM and Stratified SVARM two parameterfree and domain-independent approximation algorithms based on a representation of the Shapley value detached from the notion of marginal contribution. We prove unmatched theoretical guarantees regarding their approximation quality and provide empirical results including synthetic games as well as common explainability use cases comparing ourselves with state-of-the-art methods.



Paperid:1471
Authors:Heejo Kong, Suneung Kim, Ho-Joong Kim, Seong-Whan Lee
Dept. of Brain and Cognitive Engineering, Korea University, Dept. of Artificial Intelligence, Korea University, Dept. of Artificial Intelligence, Korea University, Dept. of Artificial Intelligence, Korea University
Abstract:
Recent advances in semisupervised learning (SSL) have relied on the optimistic assumption that labeled and unlabeled data share the same class distribution. However, this assumption is often violated in real-world scenarios, where unlabeled data may contain out-of-class samples. SSL with such uncurated unlabeled data leads training models to be corrupted. In this paper, we propose a robust SSL method for learning from uncurated real-world data within the context of open-set semi-supervised learning (OSSL). Unlike previous works that rely on feature similarity distance, our method exploits uncertainty in logits. By leveraging task-dependent predictions of logits, our method is capable of robust learning even in the presence of highly correlated outliers. Our key contribution is to present an unknown-aware graph regularization (UAG), a novel technique that enhances the performance of uncertainty-based OSSL frameworks. The technique addresses not only the conflict between training objectives for inliers and outliers but also the limitation of applying the same training rule for all outlier classes, which are existed on previous uncertainty-based approaches. Extensive experiments demonstrate that UAG surpasses state-of-the-art OSSL methods by a large margin across various protocols. Codes are available at https://github.com/heejokong/UAGreg.



Paperid:1472
Authors:Atli Kosson, Dongyang Fan, Martin Jaggi
EPFL, EPFL, EPFL
Abstract:
Batch Normalization (BN) is widely used to stabilize the optimization process and improve the test performance of deep neural networks. The regularization effect of BN depends on the batch size and explicitly using smaller batch sizes with Batch Normalization, a method known as Ghost Batch Normalization (GBN), has been found to improve generalization in many settings. We investigate the effectiveness of GBN by disentangling the induced ``Ghost Noise'' from normalization and quantitatively analyzing the distribution of noise as well as its impact on model performance. Inspired by our analysis, we propose a new regularization technique called Ghost Noise Injection (GNI) that imitates the noise in GBN without incurring the detrimental traintest discrepancy effects of small batch training. We experimentally show that GNI can provide a greater generalization benefit than GBN. Ghost Noise Injection can also be beneficial in otherwise non-noisy settings such as layer-normalized networks, providing additional evidence of the usefulness of Ghost Noise in Batch Normalization as a regularizer.



Paperid:1473
Authors:Atsutoshi Kumagai, Tomoharu Iwata, Yasuhiro Fujiwara
NTT Computer and Data Science Laboratories, NTT Communication Science Laboratories, NTT Communication Science Laboratories
Abstract:
We propose a method to learn prediction models such as classifiers for unseen target tasks where labeled and unlabeled data are absent but a few relevant input features for solving the tasks are given. Although machine learning requires data for training, data are often difficult to collect in practice. On the other hand, for many applications, a few relevant features would be more easily obtained. Although zeroshot learning or zero-shot domain adaptation use external knowledge to adapt to unseen classes or tasks without data, relevant features have not been used in existing studies. The proposed method improves the generalization performance on the target tasks, where there are no data but a few relevant features are given, by meta-learning from labeled data in related tasks. In the meta-learning phase, it is essential to simulate test phases on target tasks where prediction model learning is required without data. To this end, our neural network-based prediction model is meta-learned such that it correctly responds to perturbations of the relevant features on randomly generated synthetic data. By this modeling, the prediction model can explicitly learn the discriminability of the relevant features without real target data. When unlabeled training data are available in the target tasks, the proposed method can incorporate such data to boost the performance in a unified framework. Our experiments demonstrate that the proposed method outperforms various existing methods with four real-world datasets.



Paperid:1474
Authors:Anastasiia Kurmukova, Deniz Gunduz
Imperial College London, Imperial College London
Abstract:
This paper introduces a novel approach called "friendly attack" aimed at enhancing the performance of error correction channel codes. Inspired by the concept of adversarial attacks, our method leverages the idea of introducing slight perturbations to the neural network input, resulting in a substantial impact on the network's performance. By introducing small perturbations to fixedpoint modulated codewords before transmission, we effectively improve the decoder's performance without violating the input power constraint. The perturbation design is accomplished by a modified iterative fast gradient method. This study investigates various decoder architectures suitable for computing gradients to obtain the desired perturbations. Specifically, we consider belief propagation (BP) for LDPC codes; the error correcting code transformer, BP and neural BP (NBP) for polar codes, and neural BCJR for convolutional codes. We demonstrate that the proposed friendly attack method can improve the reliability across different channels, modulations, codes, and decoders. This method allows us to increase the reliability of communication with a legacy receiver by simply modifying the transmitted codeword appropriately.



Paperid:1475
Authors:Muhammad Rifki Kurniawan, Xiang Song, Zhiheng Ma, Yuhang He, Yihong Gong, Yang Qi, Xing Wei
Xi'an Jiaotong University, Xi'an Jiaotong University, Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences, Xi’an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Recent studies have demonstrated the potency of leveraging prompts in Transformers for continual learning (CL). Nevertheless, employing a discrete keyprompt bottleneck can lead to selection mismatches and inappropriate prompt associations during testing. Furthermore, this approach hinders adaptive prompting due to the lack of shareability among nearly identical instances at more granular level. To address these challenges, we introduce the Evolving Parameterized Prompt Memory (EvoPrompt), a novel method involving adaptive and continuous prompting attached to pre-trained Vision Transformer (ViT), conditioned on specific instance. We formulate a continuous prompt function as a neural bottleneck and encode the collection of prompts on network weights. We establish a paired prompt memory system consisting of a stable reference and a flexible working prompt memory. Inspired by linear mode connectivity, we progressively fuse the working prompt memory and reference prompt memory during inter-task periods, resulting in continually evolved prompt memory. This fusion involves aligning functionally equivalent prompts using optimal transport and aggregating them in parameter space with an adjustable bias based on prompt node attribution. Additionally, to enhance backward compatibility, we propose compositional classifier initialization, which leverages prior prototypes from pre-trained models to guide the initialization of new classifiers in a subspace-aware manner. Comprehensive experiments validate that our approach achieves state-of-the-art performance in both class and domain incremental learning scenarios.



Paperid:1476
Authors:Joonwoo Kwon, Sooyoung Kim, Yuewei Lin, Shinjae Yoo, Jiook Cha
Seoul National University, Seoul National University, Brookhaven National Laboratory, Brookhaven National Laboratory, Seoul National University
Abstract:
Neural style transfer (NST) has evolved significantly in recent years. Yet, despite its rapid progress and advancement, existing NST methods either struggle to transfer aesthetic information from a style effectively or suffer from high computational costs and inefficiencies in feature disentanglement due to using pretrained models. This work proposes a lightweight but effective model, AesFA---Aesthetic Feature-Aware NST. The primary idea is to decompose the image via its frequencies to better disentangle aesthetic styles from the reference image while training the entire model in an end-to-end manner to exclude pre-trained models at inference completely. To improve the network's ability to extract more distinct representations and further enhance the stylization quality, this work introduces a new aesthetic feature: contrastive loss. Extensive experiments and ablations show the approach not only outperforms recent NST methods in terms of stylization quality, but it also achieves faster inference. Codes are available at https://github.com/Sooyyoungg/AesFA.



Paperid:1477
Authors:Ella Lan
The Harker School
Abstract:
Diabetes mellitus is a global concern, and early detection can prevent serious complications. 50% of those with diabetes live undiagnosed, disproportionately afflicting lowincome groups. Non-invasive methods have emerged for timely detection; however, their limited accuracy constrains clinical usage. In this research, we present a novel Higher Dimensional Transformer (HDformer), the first Transformer-based architecture which utilizes long-range photoplethysmography (PPG) to detect diabetes. The long-range PPG maximizes signal contextual information when compared to the less-than 30 second signals commonly used in existing research. To increase the computational efficiency of HDformer’s long-range processing, a new attention module, Time Square Attention (TSA), is invented to achieve linear computational complexity with respect to the token volume while retaining the local/global dependencies. TSA converts the 1D inputs into 2D representations, grouping the adjacent points into a single 2D token. It then generates dynamic patches and feeds them into a gated mixture-of-experts (MoE) network, optimizing the learning on different attention areas. HDformer achieves state-of-the-art results (sensitivity 98.4, accuracy 97.3, specificity 92.8, AUC 0.929) on the standard MIMIC-III dataset, surpassing existing research. Furthermore, we develop an end-to-end solution where a low-cost wearable is prototyped to connect with the HDformer in the Cloud via a mobile app. This scalable, convenient, and affordable approach provides instantaneous detection and continuous monitoring for individuals. It aids doctors in easily screening for diabetes and safeguards underprivileged communities. The enhanced versatility of HDformer allows for efficient processing and learning of long-range signals in general one-dimensional time-series sequences, particularly for all biomedical waveforms.



Paperid:1478
Authors:Guipeng Lan, Shuai Xiao, Jiachen Yang, Jiabao Wen
Tianjin Unversity, Tianjin Univesity, Tianjin University, Tianjin University
Abstract:
How to balance the diversity and quality of results from generative models through perception rectification poses a significant challenge. Abnormal perception in generative models is typically caused by two factors: inadequate model structure and imbalanced data distribution. In response to this issue, we propose the dynamic model perception rectification algorithm (DMPRA) for generalized generative models. The core idea is to gain a comprehensive perception of the data in the generative model by appropriately highlighting the lowdensity samples in the perception space, also known as the minor group samples. The entire process can be summarized as "search-evaluation-adjustment". To identify low-density regions in the data manifold within the perception space of generative models, we introduce a filtering method based on extended neighborhood sampling. Based on the informational value of samples from low-density regions, our proposed mechanism generates informative weights to assess the significance of these samples in correcting the models' perception. By using dynamic adjustment, DMPRA ensures simultaneous enhancement of diversity and quality in the presence of imbalanced data distribution. Experimental results indicate that the algorithm has effectively improved Generative Adversarial Nets (GANs), Normalizing Flows (Flows), Variational Auto-Encoders (VAEs), and Diffusion Models (Diffusion).



Paperid:1479
Authors:Linh Le, Genghong Zhao, Xia Zhang, Guido Zuccon, Gianluca Demartini
The University of Queensland, Neusoft Research of Intelligent Healthcare Technology, Co. Ltd., Neusoft Corporation, China, The University of Queensland, The University of Queensland
Abstract:
In the machine learning field, the challenge of effectively learning with limited data has become increasingly crucial. Active Learning (AL) algorithms play a significant role in this by enhancing model performance. We introduce a novel AL algorithm, termed Colearning (CoLAL), designed to select the most diverse and representative samples within a training dataset. This approach utilizes noisy labels and predictions made by the primary model on unlabeled data. By leveraging a probabilistic graphical model, we combine two multi-class classifiers into a binary one. This classifier determines if both the main and the peer models agree on a prediction. If they do, the unlabeled sample is assumed to be easy to classify and is thus not beneficial to increase the target model's performance. We prioritize data that represents the unlabeled set without overlapping decision boundaries. The discrepancies between these boundaries can be estimated by the probability that two models result in the same prediction. Through theoretical analysis and experimental validation, we reveal that the integration of noisy labels into the peer model effectively identifies target model's potential inaccuracies. We evaluated the CoLAL method across seven benchmark datasets: four text datasets (AGNews, DBPedia, PubMed, SST-2) and text-based state-of-the-art (SOTA) baselines, and three image datasets (CIFAR100, MNIST, OpenML-155) and computer vision SOTA baselines. The results show that our CoLAL method significantly outperforms existing SOTA in text-based AL, and is competitive with SOTA image-based AL techniques.



Paperid:1480
Authors:Byung Hyun Lee, Min-hwan Oh, Se Young Chun
Seoul National University, Seoul National University, Seoul National University
Abstract:
Taskfree online continual learning (TF-CL) is a challenging problem where the model incrementally learns tasks without explicit task information. Although training with entire data from the past, present as well as future is considered as the gold standard, naive approaches in TF-CL with the current samples may be conflicted with learning with samples in the future, leading to catastrophic forgetting and poor plasticity. Thus, a proactive consideration of an unseen future sample in TF-CL becomes imperative. Motivated by this intuition, we propose a novel TF-CL framework considering future samples and show that injecting adversarial perturbations on both input data and decision-making is effective. Then, we propose a novel method named Doubly Perturbed Continual Learning (DPCL) to efficiently implement these input and decision-making perturbations. Specifically, for input perturbation, we propose an approximate perturbation method that injects noise into the input data as well as the feature vector and then interpolates the two perturbed samples. For decision-making process perturbation, we devise multiple stochastic classifiers. We also investigate a memory management scheme and learning rate scheduling reflecting our proposed double perturbations. We demonstrate that our proposed method outperforms the state-of-the-art baseline methods by large margins on various TF-CL benchmarks.



Paperid:1481
Authors:Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park
POSTECH, POSTECH, SqueezeBits Inc., SqueezeBits Inc., POSTECH
Abstract:
Large language models (LLMs) with hundreds of billions of parameters require powerful servergrade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM's footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq.



Paperid:1482
Authors:Danyeong Lee, Dohoon Lee, Dongmin Bang, Sun Kim
Interdisciplinary Program in Bioinformatics, Seoul National University, Bioinformatics Institute, Seoul National University BK21 FOUR Intelligence Computing, Seoul National University, Interdisciplinary Program in Bioinformatics, Seoul National University AIGENDRUG Co., Ltd., Interdisciplinary Program in Bioinformatics, Seoul National University AIGENDRUG Co., Ltd. Department of Computer Science and Engineering, Seoul National University Interdisciplinary Program in Artificial Intelligence, Seoul National University
Abstract:
The generation of energetically optimal 3D molecular conformers is crucial in cheminformatics and drug discovery. While deep generative models have been utilized for direct generation in Euclidean space, this approach encounters challenges, including the complexity of navigating a vast search space. Recent generative models that implement simplifications to circumvent these challenges have achieved stateof-the-art results, but this simplified approach unavoidably creates a gap between the generated conformers and the ground-truth conformational landscape. To bridge this gap, we introduce DiSCO: Diffusion Schrödinger Bridge for Molecular Conformer Optimization, a novel diffusion framework that enables direct learning of nonlinear diffusion processes in prior-constrained Euclidean space for the optimization of 3D molecular conformers. Through the incorporation of an SE(3)-equivariant Schrödinger bridge, we establish the roto-translational equivariance of the generated conformers. Our framework is model-agnostic and offers an easily implementable solution for the post hoc optimization of conformers produced by any generation method. Through comprehensive evaluations and analyses, we establish the strengths of our framework, substantiating the application of the Schrödinger bridge for molecular conformer optimization. First, our approach consistently outperforms four baseline approaches, producing conformers with higher diversity and improved quality. Then, we show that the intermediate conformers generated during our diffusion process exhibit valid and chemically meaningful characteristics. We also demonstrate the robustness of our method when starting from conformers of diverse quality, including those unseen during training. Lastly, we show that the precise generation of low-energy conformers via our framework helps in enhancing the downstream prediction of molecular properties. The code is available at https://github.com/Danyeong-Lee/DiSCO.



Paperid:1483
Authors:Dongjin Lee, Juho Lee, Kijung Shin
School of Electrical Engineering, Daejeon, South Korea, Kim Jaechul Graduate School of AI, KAIST, Seoul, South Korea, School of Electrical Engineering, Daejeon, South Korea Kim Jaechul Graduate School of AI, KAIST, Seoul, South Korea
Abstract:
Realworld graphs are dynamic, constantly evolving with new interactions, such as financial transactions in financial networks. Temporal Graph Neural Networks (TGNNs) have been developed to effectively capture the evolving patterns in dynamic graphs. While these models have demonstrated their superiority, being widely adopted in various important fields, their vulnerabilities against adversarial attacks remain largely unexplored. In this paper, we propose T-SPEAR, a simple and effective adversarial attack method for link prediction on continuous-time dynamic graphs, focusing on investigating the vulnerabilities of TGNNs. Specifically, before the training procedure of a victim model, which is a TGNN for link prediction, we inject edge perturbations to the data that are unnoticeable in terms of the four constraints we propose, and yet effective enough to cause malfunction of the victim model. Moreover, we propose a robust training approach T-SHIELD to mitigate the impact of adversarial attacks. By using edge filtering and enforcing temporal smoothness to node embeddings, we enhance the robustness of the victim model. Our experimental study shows that T-SPEAR significantly degrades the victim model's performance on link prediction tasks, and even more, our attacks are transferable to other TGNNs, which differ from the victim model assumed by the attacker. Moreover, we demonstrate that T-SHIELD effectively filters out adversarial edges and exhibits robustness against adversarial attacks, surpassing the link prediction performance of the naive TGNN by up to 11.2% under T-SPEAR. The code and datasets are available at https://github.com/wooner49/T-spear-shield



Paperid:1484
Authors:Jongyeong Lee, Chao-Kai Chiang, Masashi Sugiyama
The University of Tokyo RIKEN AIP, The University of Tokyo, RIKEN AIP The University of Tokyo
Abstract:
Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multiarmed bandit problems. Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.



Paperid:1485
Authors:Joongkyu Lee, Seung Joon Park, Yunhao Tang, Min-hwan Oh
Seoul National University, Samsung Research, DeepMind, Seoul National University
Abstract:
In reinforcement learning, temporal abstraction in the action space, exemplified by action repetition, is a technique to facilitate policy learning through extended actions. However, a primary limitation in previous studies of action repetition is its potential to degrade performance, particularly when suboptimal actions are repeated. This issue often negates the advantages of action repetition. To address this, we propose a novel algorithm named Uncertainty-aware Temporal Extension (UTE). UTE employs ensemble methods to accurately measure uncertainty during action extension. This feature allows policies to strategically choose between emphasizing exploration or adopting an uncertainty-averse approach, tailored to their specific needs. We demonstrate the effectiveness of UTE through experiments in Gridworld and Atari 2600 environments. Our findings show that UTE outperforms existing action repetition algorithms, effectively mitigating their inherent limitations and significantly enhancing policy learning efficiency.



Paperid:1486
Authors:JunHoo Lee, Yearim Kim, Hyunho Lee, Nojun Kwak
Seoul National University, Seoul National University, Seoul National University, Seoul National University
Abstract:
Although metalearning seems promising performance in the realm of rapid adaptability, it is constrained by fixed cardinality. When faced with tasks of varying cardinalities that were unseen during training, the model lacks its ability. In this paper, we address and resolve this challenge by harnessing `label equivalence' emerged from stochastic numeric label assignments during episodic task sampling. Questioning what defines ``true" meta-learning, we introduce the ``any-way" learning paradigm, an innovative model training approach that liberates model from fixed cardinality constraints. Surprisingly, this model not only matches but often outperforms traditional fixed-way models in terms of performance, convergence speed, and stability. This disrupts established notions about domain generalization. Furthermore, we argue that the inherent label equivalence naturally lacks semantic information. To bridge this semantic information gap arising from label equivalence, we further propose a mechanism for infusing semantic class information into the model. This would enhance the model's comprehension and functionality. Experiments conducted on renowned architectures like MAML and ProtoNet affirm the effectiveness of our method.



Paperid:1487
Authors:Kyungbok Lee, Myunghee Cho Paik, Min-hwan Oh, Gi-Soo Kim
Department of Statistics, Seoul National University, Department of Statistics, Seoul National University Shepherd23 Inc., Graduate School of Data Science, Seoul National University, Department of Industrial Engineering, Ulsan National Institute of Science and Technology
Abstract:
We study a novel variant of a contextual bandit problem with multidimensional reward feedback formulated as a mixed-effects model, where the correlations between multiple feedback are induced by sharing stochastic coefficients called random effects. We propose a novel algorithm, Mixed-Effects Contextual UCB (ME-CUCB), achieving tildeO(d sqrt(mT)) regret bound after T rounds where d is the dimension of contexts and m is the dimension of outcomes, with either known or unknown covariance structure. This is a tighter regret bound than that of the naive canonical linear bandit algorithm ignoring the correlations among rewards. We prove a lower bound of Omega(d sqrt(mT)) matching the upper bound up to logarithmic factors. To our knowledge, this is the first work providing a regret analysis for mixed-effects models and algorithms involving weighted least-squares estimators. Our theoretical analysis faces a significant technical challenge in that the error terms do not constitute martingales since the weights depend on the rewards. We overcome this challenge by using covering numbers, of theoretical interest in its own right. We provide numerical experiments demonstrating the advantage of our proposed algorithm, supporting the theoretical claims.



Paperid:1488
Authors:Sangho Lee, Hayun Lee, Dongkun Shin
Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University
Abstract:
Transformerbased models have demonstrated remarkable performance in various domains, including natural language processing, image processing and generative modeling. The most significant contributor to the successful performance of Transformer models is the self-attention mechanism, which allows for a comprehensive understanding of the interactions between tokens in the input sequence. However, there is a well-known scalability issue, the quadratic dependency (i.e. O(n^2)) of self-attention operations on the input sequence length n, making the handling of lengthy sequences challenging. To address this limitation, there has been a surge of research on efficient transformers, aiming to alleviate the quadratic dependency on the input sequence length. Among these, the Nyströmformer, which utilizes the Nyström method to decompose the attention matrix, achieves superior performance in both accuracy and throughput. However, its landmark selection exhibits redundancy, and the model incurs computational overhead when calculating the pseudo-inverse matrix. We propose a novel Nyström method-based transformer, called Proxyformer. Unlike the traditional approach of selecting landmarks from input tokens, the Proxyformer utilizes trainable neural memory, called proxy tokens, for landmarks. By integrating contrastive learning, input injection, and a specialized dropout for the decomposed matrix, Proxyformer achieves top-tier performance for long sequence tasks in the Long Range Arena benchmark.



Paperid:1489
Authors:Yunsung Lee, JinYoung Kim, Hyojun Go, Myeongho Jeong, Shinhyeok Oh, Seungtaek Choi
Wrtn Technologies, Twelvelabs, Twelvelabs, Yanolja, Riiid AI Research, Yanolja
Abstract:
In this paper, we address the performance degradation of efficient diffusion models by introducing MultiarchitecturE Multi-Expert diffusion models (MEME). We identify the need for tailored operations at different time-steps in diffusion processes and leverage this insight to create compact yet high-performing models. MEME assigns distinct architectures to different time-step intervals, balancing convolution and self-attention operations based on observed frequency characteristics. We also introduce a soft interval assignment strategy for comprehensive training. Empirically, MEME operates 3.3 times faster than baselines while improving image generation quality (FID scores) by 0.62 (FFHQ) and 0.37 (CelebA). Though we validate the effectiveness of assigning more optimal architecture per time-step, where efficient models outperform the larger models, we argue that MEME opens a new design choice for diffusion models that can be easily applied in other scenarios, such as large multi-expert models.



Paperid:1490
Authors:Bingheng Li, Erlin Pan, Zhao Kang
University of Electronic Science and Technology of China, Chengdu, Sichuan, China, University of Electronic Science and Technology of China, Chengdu, Sichuan, China, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
Abstract:
Recently, many carefully designed graph representation learning methods have achieved impressive performance on either strong heterophilic or homophilic graphs, but not both. Therefore, they are incapable of generalizing well across realworld graphs with different levels of homophily. This is attributed to their neglect of homophily in heterophilic graphs, and vice versa. In this paper, we propose a two-fold filtering mechanism to mine homophily in heterophilic graphs, and vice versa. In particular, we extend the graph heat equation to perform heterophilic aggregation of global information from a long distance. The resultant filter can be exactly approximated by the Possion-Charlier (PC) polynomials. To further exploit information at multiple orders, we introduce a powerful graph convolution PC-Conv and its instantiation PCNet for the node classification task. Compared to the state-of-the-art GNNs, PCNet shows competitive performance on well-known homophilic and heterophilic graphs. Our implementation is available at https://github.com/uestclbh/PC-Conv.



Paperid:1491
Authors:Chaohua Li, Enhao Zhang, Chuanxing Geng, Songcan Chen
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence
Abstract:
In openset recognition (OSR), a promising strategy is exploiting pseudo-unknown data outside given K known classes as an additional K+1-th class to explicitly model potential open space. However, treating unknown classes without distinction is unequal for them relative to known classes due to the category-agnostic and scale-agnostic of the unknowns. This inevitably not only disrupts the inherent distributions of unknown classes but also incurs both class-wise and instance-wise imbalances between known and unknown classes. Ideally, the OSR problem should model the whole class space as K+∞, but enumerating all unknowns is impractical. Since the core of OSR is to effectively model the boundaries of known classes, this means just focusing on the unknowns nearing the boundaries of targeted known classes seems sufficient. Thus, as a compromise, we convert the open classes from infinite to K, with a novel concept Target-Aware Universum (TAU) and propose a simple yet effective framework Dual Contrastive Learning with Target-Aware Universum (DCTAU). In details, guided by the targeted known classes, TAU automatically expands the unknown classes from the previous 1 to K, effectively alleviating the distribution disruption and the imbalance issues mentioned above. Then, a novel Dual Contrastive (DC) loss is designed, where all instances irrespective of known or TAU are considered as positives to contrast with their respective negatives. Experimental results indicate DCTAU sets a new state-of-the-art.



Paperid:1492
Authors:Chen Li, Yoshihiro Yamanishi
Graduate School of Informatics, Nagoya University, Chikusa, Nagoya, 464-8601, Japan, Graduate School of Informatics, Nagoya University, Chikusa, Nagoya, 464-8601, Japan
Abstract:
The de novo generation of hitlike molecules that show bioactivity and drug-likeness is an important task in computer-aided drug discovery. Although artificial intelligence can generate molecules with desired chemical properties, most previous studies have ignored the influence of disease-related cellular environments. This study proposes a novel deep generative model called GxVAEs to generate hit-like molecules from gene expression profiles by leveraging two joint variational autoencoders (VAEs). The first VAE, ProfileVAE, extracts latent features from gene expression profiles. The extracted features serve as the conditions that guide the second VAE, which is called MolVAE, in generating hit-like molecules. GxVAEs bridge the gap between molecular generation and the cellular environment in a biological system, and produce molecules that are biologically meaningful in the context of specific diseases. Experiments and case studies on the generation of therapeutic molecules show that GxVAEs outperforms current state-of-the-art baselines and yield hit-like molecules with potential bioactivity and drug-like properties. We were able to successfully generate the potential molecular structures with therapeutic effects for various diseases from patients’ disease profiles.



Paperid:1493
Authors:Depeng Li, Tianqi Wang, Junwei Chen, Qining Ren, Kenji Kawaguchi, Zhigang Zeng
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, School of Computing, National University of Singapore, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Abstract:
Deep neural networks are susceptible to catastrophic forgetting when trained on sequential tasks. Various continual learning (CL) methods often rely on exemplar buffers or/and network expansion for balancing model stability and plasticity, which, however, compromises their practical value due to privacy and memory concerns. Instead, this paper considers a strict yet realistic setting, where the training data from previous tasks is unavailable and the model size remains relatively constant during sequential training. To achieve such desiderata, we propose a conceptually simple yet effective method that attributes forgetting to layerwise parameter overwriting and the resulting decision boundary distortion. This is achieved by the synergy between two key components: HSIC-Bottleneck Orthogonalization (HBO) implements non-overwritten parameter updates mediated by Hilbert-Schmidt independence criterion in an orthogonal space and EquiAngular Embedding (EAE) enhances decision boundary adaptation between old and new tasks with predefined basis vectors. Extensive experiments demonstrate that our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.



Paperid:1494
Authors:Fengpeng Li, Kemou Li, Jinyu Tian, Jiantao Zhou
University of Macau, University of Macau, Macau University of Science and Technology, University of Macau
Abstract:
The deep model training procedure requires largescale datasets of annotated data. Due to the difficulty of annotating a large number of samples, label noise caused by incorrect annotations is inevitable, resulting in low model performance and poor model generalization. To combat label noise, current methods usually select clean samples based on the small-loss criterion and use these samples for training. Due to some noisy samples similar to clean ones, these small-loss criterion-based methods are still affected by label noise. To address this issue, in this work, we propose Regroup Median Loss (RML) to reduce the probability of selecting noisy samples and correct losses of noisy samples. RML randomly selects samples with the same label as the training samples based on a new loss processing method. Then, we combine the stable mean loss and the robust median loss through a proposed regrouping strategy to obtain robust loss estimation for noisy samples. To further improve the model performance against label noise, we propose a new sample selection strategy and build a semi-supervised method based on RML. Compared to state-of-the-art methods, for both the traditionally trained and semi-supervised models, RML achieves a significant improvement on synthetic and complex real-world datasets. The source is at https://github.com/Feng-peng-Li/Regroup-Loss-Median-to-Combat-Label-Noise.



Paperid:1495
Authors:Fuhao Li, Ziyang Gong, Yupeng Deng, Xianzheng Ma, Renrui Zhang, Zhenming Ji, Xiangwei Zhu, Hong Zhang
Wuhan University of Science and Technology, Sun Yat-sen University, National University of Singapore, University of Oxford, Shanghai AI Lab, Sun Yat-sen University, Sun Yat-sen University, Wuhan University of Science and Technology
Abstract:
Although recent methods in Unsupervised Domain Adaptation (UDA) have achieved success in segmenting rainy or snowy scenes by improving consistency, they face limitations when dealing with more challenging scenarios like foggy and night scenes. We argue that these prior methods excessively focus on weatherspecific features in adverse scenes, which exacerbates the existing domain gaps. To address this issue, we propose a new metric to evaluate the severity of all adverse scenes and offer a novel perspective that enables task unification across all adverse scenarios. Our method focuses on Severity, allowing our model to learn more consistent features and facilitate domain distribution alignment, thereby alleviating domain gaps. Unlike the vague descriptions of consistency in previous methods, we introduce Cross-domain Consistency, which is quantified using the Structure Similarity Index Measure (SSIM) to measure the distance between the source and target domains. Specifically, our unified model consists of two key modules: the Merging Style Augmentation Module (MSA) and the Severity Perception Mask Module (SPM). The MSA module transforms all adverse scenes into augmented scenes, effectively eliminating weather-specific features and enhancing Cross-domain Consistency. The SPM module incorporates a Severity Perception mechanism, guiding a Mask operation that enables our model to learn highly consistent features from the augmented scenes. Our unified framework, named PASS (Parsing All adverSe Scenes), achieves significant performance improvements over state-of-the-art methods on widely-used benchmarks for all adverse scenes. Notably, the performance of PASS is superior to Semi-Unified models and even surpasses weather-specific models.



Paperid:1496
Authors:Jason Chun Lok Li, Chang Liu, Binxiao Huang, Ngai Wong
The University of Hong Kong, The University of Hong Kong, The University of Hong Kong, The University of Hong Kong
Abstract:
Existing approaches to Implicit Neural Representation (INR) can be interpreted as a global scene representation via a linear combination of Fourier bases of different frequencies. However, such universal basis functions can limit the representation capability in local regions where a specific component is unnecessary, resulting in unpleasant artifacts. To this end, we introduce a learnable spatial mask that effectively dispatches distinct Fourier bases into respective regions. This translates into collaging Fourier patches, thus enabling an accurate representation of complex signals. Comprehensive experiments demonstrate the superior reconstruction quality of the proposed approach over existing baselines across various INR tasks, including image fitting, video representation, and 3D shape representation. Our method outperforms all other baselines, improving the image fitting PSNR by over 3dB and 3D reconstruction to 98.81 IoU and 0.0011 Chamfer Distance.



Paperid:1497
Authors:Jian Li, Yong Liu, Weiping Wang
Institute of Information Engineering, Chinese Academy of Sciences, Gaoling School of Artificial Intelligence, Renmin University of China, Institute of Information Engineering, Chinese Academy of Sciences
Abstract:
Overparameterization often leads to benign overfitting, where deep neural networks can be trained to overfit the training data but still generalize well on unseen data. However, it lacks a generalized asymptotic framework for nonlinear regressions and connections to conventional complexity notions. In this paper, we propose a generalized highdimensional analysis for nonlinear regression models, including various nonlinear feature mapping methods and subsampling. Specifically, we first provide an implicit regularization parameter and asymptotic equivalents related to a classical complexity notion, i.e., effective dimension. We then present a high-dimensional analysis for nonlinear ridge regression and extend it to ridgeless regression in the under-parameterized and over-parameterized regimes, respectively. We find that the limiting risks decrease with the effective dimension. Motivated by these theoretical findings, we propose an algorithm, namely RFRed, to improve generalization ability. Finally, we validate our theoretical findings and the proposed algorithm through several experiments.



Paperid:1498
Authors:Jian Li, Yong Liu, Weiping Wang
Institute of Information Engineering, Chinese Academy of Sciences, Gaoling School of Artificial Intelligence, Renmin University of China, Institute of Information Engineering, Chinese Academy of Sciences
Abstract:
Recent Newtontype federated learning algorithms have demonstrated linear convergence with respect to the communication rounds. However, communicating Hessian matrices is often unfeasible due to their quadratic communication complexity. In this paper, we introduce a novel approach to tackle this issue while still achieving fast convergence rates. Our proposed method, named as Federated Newton Sketch methods (FedNS), approximates the centralized Newton's method by communicating the sketched square-root Hessian instead of the exact Hessian. To enhance communication efficiency, we reduce the sketch size to match the effective dimension of the Hessian matrix. We provide convergence analysis based on statistical learning for the federated Newton sketch approaches. Specifically, our approaches reach super-linear convergence rates w.r.t. the communication rounds for the first time. We validate the effectiveness of our algorithms through various experiments, which coincide with our theoretical findings.



Paperid:1499
Authors:Jiangmeng Li, Yifan Jin, Hang Gao, Wenwen Qiang, Changwen Zheng, Fuchun Sun
Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences State Key Laboratory of Intelligent Game, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, Science & Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences Tsinghua University
Abstract:
Graph contrastive learning (GCL) aims to align the positive features while differentiating the negative features in the latent space by minimizing a pairwise contrastive loss. As the embodiment of an outstanding discriminative unsupervised graph representation learning approach, GCL achieves impressive successes in various graph benchmarks. However, such an approach falls short of recognizing the topology isomorphism of graphs, resulting in that graphs with relatively homogeneous node features cannot be sufficiently discriminated. By revisiting classic graph topology recognition works, we disclose that the corresponding expertise intuitively complements GCL methods. To this end, we propose a novel hierarchical topology isomorphism expertise embedded graph contrastive learning, which introduces knowledge distillations to empower GCL models to learn the hierarchical topology isomorphism expertise, including the graph-tier and subgraph-tier. On top of this, the proposed method holds the feature of plug-and-play, and we empirically demonstrate that the proposed method is universal to multiple state-of-the-art GCL models. The solid theoretical analyses are further provided to prove that compared with conventional GCL methods, our method acquires the tighter upper bound of Bayes classification error. We conduct extensive experiments on real-world benchmarks to exhibit the performance superiority of our method over candidate GCL methods, e.g., for the real-world graph representation learning experiments, the proposed method beats the state-of-the-art method by 0.23% on unsupervised representation learning setting, 0.43% on transfer learning setting. Our code is available at https://github.com/jyf123/HTML.



Paperid:1500
Authors:Jin Li, Qirong Zhang, Shuling Xu, Xinlong Chen, Longkun Guo, Yang-Geng Fu
College of Computer and Data Science, Fuzhou University AI Thrust, Information Hub, HKUST (Guangzhou), College of Computer and Data Science, Fuzhou University, College of Computer and Data Science, Fuzhou University, College of Computer and Data Science, Fuzhou University, College of Computer and Data Science, Fuzhou University Shandong Fundamental Research Center for Computer Science, College of Computer and Data Science, Fuzhou University
Abstract:
Despite Graph neural networks' significant performance gain over many classic techniques in various graphrelated downstream tasks, their successes are restricted in shallow models due to over-smoothness and the difficulties of optimizations among many other issues. In this paper, to alleviate the over-smoothing issue, we propose a soft graph normalization method to preserve the diversities of node embeddings and prevent indiscrimination due to possible over-closeness. Combined with residual connections, we analyze the reason why the method can effectively capture the knowledge in both input graph structures and node features even with deep networks. Additionally, inspired by Curriculum Learning that learns easy examples before the hard ones, we propose a novel label-smoothing-based learning framework to enhance the optimization of deep GNNs, which iteratively smooths labels in an auxiliary graph and constructs many gradual non-smooth tasks for extracting increasingly complex knowledge and gradually discriminating nodes from coarse to fine. The method arguably reduces the risk of overfitting and generalizes better results. Finally, extensive experiments are carried out to demonstrate the effectiveness and potential of the proposed model and learning framework through comparison with twelve existing baselines including the state-of-the-art methods on twelve real-world node classification benchmarks.



Paperid:1501
Authors:Jing Li, Quanxue Gao, Qianqian Wang, Wei Xia
School of Telecommunications Engineering, Xidian University, School of Telecommunications Engineering, Xidian University, School of Telecommunications Engineering, Xidian University Key Laboratory of Measurement and Control of Complex Systems of Engineering (Southeast University), Ministry of Education., School of Telecommunications Engineering, Xidian University
Abstract:
Graphbased multimedia data clustering has attracted much attention due to the impressive clustering performance for arbitrarily shaped multimedia data. However, existing graph-based clustering methods need post-processing to get labels for multimedia data with high computational complexity. Moreover, it is sub-optimal for label learning due to the fact that they exploit the complementary information embedded in data with different types pixel by pixel. To handle these problems, we present a novel label learning model with good interpretability for clustering. To be specific, our model decomposes anchor graph into the products of two matrices with orthogonal non-negative constraint to directly get soft label without any post-processing, which remarkably reduces the computational complexity. To well exploit the complementary information embedded in multimedia data, we introduce tensor Schatten p-norm regularization on the label tensor which is composed of soft labels of multimedia data. The solution can be obtained by iteratively optimizing four decoupled sub-problems, which can be solved more efficiently with good convergence. Experimental results on various datasets demonstrate the efficiency of our model.



Paperid:1502
Authors:Jingtao Li, Xing Chen, Li Yang, Adnan Siraj Rakin, Deliang Fan, Chaitali Chakrabarti
Sony AI, Arizona State University, University of North Carolina at Charlotte, Binghamton University (SUNY), Johns Hopkins University, Arizona State University
Abstract:
Split Federated Learning (SFL) is an emerging edgefriendly version of Federated Learning (FL), where clients process a small portion of the entire model. While SFL was considered to be resistant to Model Extraction Attack (MEA) by design, a recent work shows it is not necessarily the case. In general, gradient-based MEAs are not effective on a target model that is changing, as is the case in training-from-scratch applications. In this work, we propose a strong MEA during the SFL training phase. The proposed Early-Mix-GAN (EMGAN) attack effectively exploits gradient queries regardless of data assumptions. EMGAN adopts three key components to address the problem of inconsistent gradients. Specifically, it employs (i) Early-learner approach for better adaptability, (ii) Multi-GAN approach to introduce randomness in generator training to mitigate mode collapse, and (iii) ProperMix to effectively augment the limited amount of synthetic data for a better approximation of the target domain data distribution. EMGAN achieves excellent results in extracting server-side models. With only 50 training samples, EMGAN successfully extracts a 5-layer server-side model of VGG-11 on CIFAR-10, with 7% less accuracy than the target model. With zero training data, the extracted model achieves 81.3% accuracy, which is significantly better than the 45.5% accuracy of the model extracted by the SoTA method. The code is available at "https://github.com/zlijingtao/SFL-MEA".



Paperid:1503
Authors:Jiyong Li, Dilshod Azizov, Yang LI, Shangsong Liang
Sun Yat-sen University Guangdong Key Laboratory of Big Data Analysis and Processing, Mohamed bin Zayed University of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology, Sun Yat-sen University Guangdong Key Laboratory of Big Data Analysis and Processing Mohamed bin Zayed University of Artificial Intelligence
Abstract:
Recently, because of the highquality representations of contrastive learning methods, rehearsal-based contrastive continual learning has been proposed to explore how to continually learn transferable representation embeddings to avoid the catastrophic forgetting issue in traditional continual settings. Based on this framework, we propose Contrastive Continual Learning via Importance Sampling (CCLIS) to preserve knowledge by recovering previous data distributions with a new strategy for Replay Buffer Selection (RBS), which minimize estimated variance to save hard negative samples for representation learning with high quality. Furthermore, we present the Prototype-instance Relation Distillation (PRD) loss, a technique designed to maintain the relationship between prototypes and sample representations using a self-distillation process. Experiments on standard continual learning benchmarks reveal that our method notably outperforms existing baselines in terms of knowledge preservation and thereby effectively counteracts catastrophic forgetting in online contexts. The code is available at https://github.com/lijy373/CCLIS.



Paperid:1504
Authors:Lan Li, Bowen Tao, Lu Han, De-chuan Zhan, Han-jia Ye
Nanjing University, Nanjing University, Nanjing University, Nanjing University, Nanjing University
Abstract:
Differing from traditional semisupervised learning, class-imbalanced semi-supervised learning presents two distinct challenges: (1) The imbalanced distribution of training samples leads to model bias towards certain classes, and (2) the distribution of unlabeled samples is unknown and potentially distinct from that of labeled samples, which further contributes to class bias in the pseudo-labels during the training. To address these dual challenges, we introduce a novel approach called Twice Class Bias Correction (TCBC). We begin by utilizing an estimate of the class distribution from the participating training samples to correct the model, enabling it to learn the posterior probabilities of samples under a class-balanced prior. This correction serves to alleviate the inherent class bias of the model. Building upon this foundation, we further estimate the class bias of the current model parameters during the training process. We apply a secondary correction to the model's pseudo-labels for unlabeled samples, aiming to make the assignment of pseudo-labels across different classes of unlabeled samples as equitable as possible. Through extensive experimentation on CIFAR10/100-LT, STL10-LT, and the sizable long-tailed dataset SUN397, we provide conclusive evidence that our proposed TCBC method reliably enhances the performance of class-imbalanced semi-supervised learning.



Paperid:1505
Authors:Long-Fei Li, Peng Zhao, Zhi-Hua Zhou
Nanjing University, Nanjing University, Nanjing University
Abstract:
We study reinforcement learning (RL) in episodic MDPs with adversarial fullinformation losses and the unknown transition. Instead of the classical static regret, we adopt dynamic regret as the performance measure which benchmarks the learner's performance with changing policies, making it more suitable for non-stationary environments. The primary challenge is to handle the uncertainties of unknown transition and unknown non-stationarity of environments simultaneously. We propose a general framework to decouple the two sources of uncertainties and show the dynamic regret bound naturally decomposes into two terms, one due to constructing confidence sets to handle the unknown transition and the other due to choosing sub-optimal policies under the unknown non-stationarity. To this end, we first employ the two-layer online ensemble structure to handle the adaptation error due to the unknown non-stationarity, which is model-agnostic. Subsequently, we instantiate the framework to three fundamental MDP models, including tabular MDPs, linear MDPs and linear mixture MDPs, and present corresponding approaches to control the exploration error due to the unknown transition. We provide dynamic regret guarantees respectively and show they are optimal in terms of the number of episodes K and the non-stationarity P̄ᴋ by establishing matching lower bounds. To the best of our knowledge, this is the first work that achieves the dynamic regret exhibiting optimal dependence on K and P̄ᴋ without prior knowledge about the non-stationarity for adversarial MDPs with unknown transition.



Paperid:1506
Authors:Mengke Li, Zhikai HU, Yang Lu, Weichao Lan, Yiu-ming Cheung, Hui Huang
Guangdong Laboratory of Artificial Intelligence and Digital Economy Shenzhen University, Hong Kong Baptist University, Xiamen University, Hong Kong Baptist University, Hong Kong Baptist University, Shenzhen University
Abstract:
The imbalanced distribution of longtailed data presents a considerable challenge for deep learning models, as it causes them to prioritize the accurate classification of head classes but largely disregard tail classes. The biased decision boundary caused by inadequate semantic information in tail classes is one of the key factors contributing to their low recognition accuracy. To rectify this issue, we propose to augment tail classes by grafting the diverse semantic information from head classes, referred to as head-to-tail fusion (H2T). We replace a portion of feature maps from tail classes with those belonging to head classes. These fused features substantially enhance the diversity of tail classes. Both theoretical analysis and practical experimentation demonstrate that H2T can contribute to a more optimized solution for the decision boundary. We seamlessly integrate H2T in the classifier adjustment stage, making it a plug-and-play module. Its simplicity and ease of implementation allow for smooth integration with existing long-tailed recognition methods, facilitating a further performance boost. Extensive experiments on various long-tailed benchmarks demonstrate the effectiveness of the proposed H2T. The source code is available at https://github.com/Keke921/H2T.



Paperid:1507
Authors:Mingxin Li, Richong Zhang, Zhijie Nie, Yongyi Mao
SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China, SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China, SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China Shen Yuan Honors College, Beihang University, Beijing, China, School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada
Abstract:
Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with the Contrastive Learning of Sentence Embeddings (CSE) being the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, with their only difference lying in the training data. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, since alignment and uniformity only measure the results, they fail to answer "What aspects of the training data contribute to the performance gap?" and "How can the performance gap be narrowed?", In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and unsupervised CSE during their respective training processes. From the comparison, we identify the similarity pattern as a key factor to the performance gap, and introduce a metric, called Relative Fitting Difficulty (RFD), to measure the complexity of the similarity pattern. Then, based on the insights gained from the "What" question, we tackle the "How" question by increasing the pattern complexity of the training data. We achieve this by leveraging the InContext Learning (ICL) capability of the Large Language Model (LLM) to generate data that simulates complex patterns. By utilizing the hierarchical patterns in the LLM-generated data, we effectively narrow the gap between supervised and unsupervised CSE. We release our codes and appendix at https://github.com/BDBC-KG-NLP/NGCSE.



Paperid:1508
Authors:Shengrui Li, Xueting Han, Jing Bai
Tsinghua University Microsoft Research Asia, Microsoft Research Asia, Microsoft Research Asia
Abstract:
Finetuning pre-trained models has recently yielded remarkable performance gains in graph neural networks (GNNs). In addition to pre-training techniques, inspired by the latest work in the natural language fields, more recent work has shifted towards applying effective fine-tuning approaches, such as parameter-efficient fine-tuning (PEFT). However, given the substantial differences between GNNs and transformer-based models, applying such approaches directly to GNNs proved to be less effective. In this paper, we present a comprehensive comparison of PEFT techniques for GNNs and propose a novel PEFT method specifically designed for GNNs, called AdapterGNN. AdapterGNN preserves the knowledge of the large pre-trained model and leverages highly expressive adapters for GNNs, which can adapt to downstream tasks effectively with only a few parameters, while also improving the model's generalization ability. Extensive experiments show that AdapterGNN achieves higher performance than other PEFT methods and is the only one consistently surpassing full fine-tuning (outperforming it by 1.6% and 5.7% in the chemistry and biology domains respectively, with only 5% and 4% of its parameters tuned) with lower generalization gaps. Moreover, we empirically show that a larger GNN model can have a worse generalization ability, which differs from the trend observed in large transformer-based models. Building upon this, we provide a theoretical justification for PEFT can improve generalization of GNNs by applying generalization bounds. Our code is available at https://github.com/Lucius-lsr/AdapterGNN.



Paperid:1509
Authors:Siyuan Li, Xun Wang, Rongchang Zuo, Kewu Sun, Lingfei Cui, Jishiyu Ding, Peng Liu, Zhe Ma
Harbin Institute of Technology, Intelligent Science & Technology Academy Limited of CASIC, Harbin Institute of Technology, Intelligent Science & Technology Academy Limited of CASIC, Institute of Computer Application Technology, Norinco Group, Intelligent Science & Technology Academy Limited of CASIC, Harbin Institute of Technology, Intelligent Science & Technology Academy Limited of CASIC
Abstract:
Imitation learning (IL) has achieved considerable success in solving complex sequential decisionmaking problems. However, current IL methods mainly assume that the environment for learning policies is the same as the environment for collecting expert datasets. Therefore, these methods may fail to work when there are slight differences between the learning and expert environments, especially for challenging problems with high-dimensional image observations. However, in real-world scenarios, it is rare to have the chance to collect expert trajectories precisely in the target learning environment. To address this challenge, we propose a novel robust imitation learning approach, where we develop an inverse dynamics state representation learning objective to align the expert environment and the learning environment. With the abstract state representation, we design an effective reward function, which thoroughly measures the similarity between behavior data and expert data not only element-wise, but also from the trajectory level. We conduct extensive experiments to evaluate the proposed approach under various visual perturbations and in diverse visual control tasks. Our approach can achieve a near-expert performance in most environments, and significantly outperforms the state-of-the-art visual IL methods and robust IL methods.



Paperid:1510
Authors:Tianchun Li, Chengxiang Wu, Pengyi Shi, Xiaoqian Wang
Purdue University, Purdue University, Purdue University, Purdue University
Abstract:
Timeseries generation has crucial practical significance for decision-making under uncertainty. Existing methods have various limitations like accumulating errors over time, significantly impacting downstream tasks. We develop a novel generation method, DT-VAE, that incorporates generalizable domain knowledge, is mathematically justified, and significantly outperforms existing methods by mitigating error accumulation through a cumulative difference learning mechanism. We evaluate the performance of DT-VAE on several downstream tasks using both semi-synthetic and real time-series datasets, including benchmark datasets and our newly curated COVID-19 hospitalization datasets. The COVID-19 datasets enrich existing resources for time-series analysis. Additionally, we introduce Diverse Trend Preserving (DTP), a time-series clustering-based evaluation for direct and interpretable assessments of generated samples, serving as a valuable tool for evaluating time-series generative models.



Paperid:1511
Authors:Wenjie Li, Qifan Song, Jean Honorio, Guang Lin
Department of Statistics, Purdue University, Department of Statistics, Purdue University, School of Computing and Information Systems, The University of Melbourne, Departments of Mathematics and School of Mechanical Engineering, Purdue University
Abstract:
This work establishes the first framework of federated Xarmed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named Fed-PNE. By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of Fed-PNE over various centralized and federated baseline algorithms.



Paperid:1512
Authors:Wenjun Li, Pradeep Varakantham
Singapore Management University, Singapore Management University
Abstract:
To train generalizable Reinforcement Learning (RL) agents, researchers recently proposed the Unsupervised Environment Design (UED) framework, in which a teacher agent creates a very large number of training environments and a student agent trains on the experiences in these environments to be robust against unseen testing scenarios. For example, to train a student to master the “stepping over stumps” task, the teacher will create numerous training environments with varying stump heights and shapes. In this paper, we argue that UED neglects training efficiency and its need for very large number of environments (henceforth referred to as infinite horizon training) makes it less suitable to training robots and nonexpert humans. In real-world applications where either creating new training scenarios is expensive or training efficiency is of critical importance, we want to maximize both the learning efficiency and learning outcome of the student. To achieve efficient finite horizon training, we propose a novel Markov Decision Process (MDP) formulation for the teacher agent, referred to as Unsupervised Training Sequence Design (UTSD). Specifically, we encode salient information from the student policy (e.g., behaviors and learning progress) into the teacher's state space, enabling the teacher to closely track the student's learning progress and consequently discover the optimal training sequences with finite lengths. Additionally, we explore the teacher's efficient adaptation to unseen students at test time by employing the context-based meta-learning approach, which leverages the teacher's past experiences with various students. Finally, we empirically demonstrate our teacher's capability to design efficient and effective training sequences for students with varying capabilities.



Paperid:1513
Authors:Xiaochuan Li, Baoyu Fan, Runze Zhang, Liang Jin, Di Wang, Zhenhua Guo, Yaqian Zhao, Rengang Li
Inspur Electronic Information Industry Co.,Ltd. Shandong Massive Information Technology Research Institute, Nankai University Inspur Electronic Information Industry Co.,Ltd., Inspur Electronic Information Industry Co.,Ltd., Inspur Electronic Information Industry Co.,Ltd., Inspur Electronic Information Industry Co.,Ltd., Inspur Electronic Information Industry Co.,Ltd., Inspur Electronic Information Industry Co.,Ltd., Tsinghua University Inspur Electronic Information Industry Co.,Ltd.
Abstract:
The emergence of ChatGPT has once again sparked research in generative artificial intelligence (GAI). While people have been amazed by the generated results, they have also noticed the reasoning potential reflected in the generated textual content. However, this current ability for causal reasoning is primarily limited to the domain of language generation, such as in models like GPT3. In visual modality, there is currently no equivalent research. Considering causal reasoning in visual content generation is significant. This is because visual information contains infinite granularity. Particularly, images can provide more intuitive and specific demonstrations for certain reasoning tasks, especially when compared to coarse-grained text. Hence, we propose a new image generation task called visual question answering with image (VQAI) and establish a dataset of the same name based on the classic Tom and Jerry animated series. Additionally, we develop a new paradigm for image generation to tackle the challenges of this task. Finally, we perform extensive experiments and analyses, including visualizations of the generated content and discussions on the potentials and limitations. The code and data are publicly available under the license of CC BY-NC-SA 4.0 for academic and non-commercial usage at: https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md.



Paperid:1514
Authors:Xingjian Li, Pengkun Yang, Yangcheng Gu, Xueying Zhan, Tianyang Wang, Min Xu, Chengzhong Xu
Computational Biology Department, Carnegie Mellon University, Center for Statistical Science, Tsinghua University, School of Software, Tsinghua University, Computational Biology Department, Carnegie Mellon University, Department of Computer Science, University of Alabama at Birmingham, Computational Biology Department, Carnegie Mellon University, State Key Lab of IOTSC, University of Macau
Abstract:
Uncertainty estimation for unlabeled data is crucial to active learning. With a deep neural network employed as the backbone model, the data selection process is highly challenging due to the potential overconfidence of the model inference. Existing methods resort to special learning fashions (e.g. adversarial) or auxiliary models to address this challenge. This tends to result in complex and inefficient pipelines, which would render the methods impractical. In this work, we propose a novel algorithm that leverages noise stability to estimate data uncertainty. The key idea is to measure the output derivation from the original observation when the model parameters are randomly perturbed by noise. We provide theoretical analyses by leveraging the small Gaussian noise theory and demonstrate that our method favors a subset with large and diverse gradients. Our method is generally applicable in various tasks, including computer vision, natural language processing, and structural data analysis. It achieves competitive performance compared against state-of-the-art active learning baselines.



Paperid:1515
Authors:Xinshu Li, Lina Yao
University of New South Wales, University of New South Wales CSIRO's Data61
Abstract:
Instrumental variables (IVs), widely applied in economics and healthcare, enable consistent counterfactual prediction in the presence of hidden confounding factors, effectively addressing endogeneity issues. The prevailing IVbased counterfactual prediction methods typically rely on the availability of valid IVs (satisfying Relevance, Exclusivity, and Exogeneity), a requirement which often proves elusive in real-world scenarios. Various data-driven techniques are being developed to create valid IVs (or representations of IVs) from a pool of IV candidates. However, most of these techniques still necessitate the inclusion of valid IVs within the set of candidates. This paper proposes a distribution-conditioned adversarial variational autoencoder to tackle this challenge. Specifically: 1) for Relevance and Exclusivity, we deduce the corresponding evidence lower bound following the Bayesian network structure and build the variational autoencoder; accordingly, 2) for Exogeneity , we design an adversarial game to encourage latent factors originating from the marginal distribution, compelling the independence between IVs and other outcome-related factors. Extensive experimental results validate the effectiveness, stability and generality of our proposed model in generating valid IV factors in the absence of valid IV candidates.



Paperid:1516
Authors:Xinyao Li, Jingjing Li, Fengling Li, Lei Zhu, Ke Lu
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China (UESTC) Shenzhen Institute for Advanced Study, UESTC, University of Technology Sydney, School of Electronic and Information Engineering, Tongji University, University of Electronic Science and Technology of China
Abstract:
Efficiently utilizing rich knowledge in pretrained models has become a critical topic in the era of large models. This work focuses on adaptively utilize knowledge from multiple sourcepretrained models to an unlabeled target domain without accessing the source data. Despite being a practically useful setting, existing methods require extensive parameter tuning over each source model, which is computationally expensive when facing abundant source domains or larger source models. To address this challenge, we propose a novel approach which is free of the parameter tuning over source backbones. Our technical contribution lies in the Bi-level ATtention ENsemble (Bi-ATEN) module, which learns both intra-domain weights and inter-domain ensemble weights to achieve a fine balance between instance specificity and domain consistency. By slightly tuning source bottlenecks, we achieve comparable or even superior performance on a challenging benchmark DomainNet with less than 3% trained parameters and 8 times of throughput compared with SOTA method. Furthermore, with minor modifications, the proposed module can be easily equipped to existing methods and gain more than 4% performance boost. Code is available at https://github.com/TL-UESTC/Bi-ATEN.



Paperid:1517
Authors:Xunkai Li, Yulin Zhao, Zhengyu Wu, Wentao Zhang, Rong-Hua Li, Guoren Wang
Beijing Institute of Technology, Shandong University, Beijing Institute of Technology, Peking University National Engineering Labratory for Big Data Analytics and Applications, Beijing Institute of Technology Shenzhen Institute of Technology, Beijing Institute of Technology
Abstract:
With the rapid advancement of AI applications, the growing needs for data privacy and model robustness have highlighted the importance of machine unlearning, especially in thriving graphbased scenarios. However, most existing graph unlearning strategies primarily rely on well-designed architectures or manual process, rendering them less user-friendly and posing challenges in terms of deployment efficiency. Furthermore, striking a balance between unlearning performance and framework generalization is also a pivotal concern. To address the above issues, we propose Mutual Evolution Graph Unlearning (MEGU), a new mutual evolution paradigm that simultaneously evolves the predictive and unlearning capacities of graph unlearning. By incorporating aforementioned two components, MEGU ensures complementary optimization in a unified training framework that aligns with the prediction and unlearning requirements. Extensive experiments on 9 graph benchmark datasets demonstrate the superior performance of MEGU in addressing unlearning requirements at the feature, node, and edge levels. Specifically, MEGU achieves average performance improvements of 2.7%, 2.5%, and 3.2% across these three levels of unlearning tasks when compared to state-of-the-art baselines. Furthermore, MEGU exhibits satisfactory training efficiency, reducing time and space overhead by an average of 159.8x and 9.6x, respectively, in comparison to retraining GNN from scratch.



Paperid:1518
Authors:Ye Li, Ting Du, Yiwen Pang, Zhongyi Huang
Nanjing University of Aeronautics and Astronautics, Tsinghua University, Nanjing University of Aeronautics and Astronautics, Tsinghua University
Abstract:
Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this paper, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in fewshot scenarios, showcasing its excellent generalization ability in practical situations.



Paperid:1519
Authors:Yujie Li, Xin Yang, Hao Wang, Xiangkun Wang, Tianrui Li
Southwestern University of Finance and Economics, Southwestern University of Finance and Economics, Nanyang Technological University, Southwestern University of Finance and Economics, Southwest Jiaotong University
Abstract:
This paper studies the problem of continual learning in an openworld scenario, referred to as Open-world Continual Learning (OwCL). OwCL is increasingly rising while it is highly challenging in two-fold: i) learning a sequence of tasks without forgetting knowns in the past, and ii) identifying unknowns (novel objects/classes) in the future. Existing OwCL methods suffer from the adaptability of task-aware boundaries between knowns and unknowns, and do not consider the mechanism of knowledge transfer. In this work, we propose Pro-KT, a novel prompt-enhanced knowledge transfer model for OwCL. Pro-KT includes two key components: (1) a prompt bank to encode and transfer both task-generic and task-specific knowledge, and (2) a task-aware open-set boundary to identify unknowns in the new tasks. Experimental results using two real-world datasets demonstrate that the proposed Pro-KT outperforms the state-of-the-art counterparts in both the detection of unknowns and the classification of knowns markedly. Code released at https://github.com/YujieLi42/Pro-KT.



Paperid:1520
Authors:Yunchen Li, Zhou Yu, Gaoqi He, Yunhang Shen, Ke Li, Xing Sun, Shaohui Lin
East China Normal University, East China Normal University Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, China, East China Normal University, Tencent, Tencent, Tencent, East China Normal University Key Laboratory of Advanced Theory and Application in Statistics and Data Science, Ministry of Education, China
Abstract:
Symmetric positive definite(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on E(X|y), where y is a vector and X is an SPD matrix. However, these methods are challenging to handle for largescale data. In this paper, inspired by denoising diffusion probabilistic model(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate E(X|y). Moreover, our model can estimate p(X) unconditionally and flexibly without giving y. On the one hand, the model conditionally learns p(X|y) and utilizes the mean of samples to obtain E(X|y) as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data p(X) and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and conditionally.



Paperid:1521
Authors:Zhiyuan Li, Wenshuai Zhao, Lijun Wu, Joni Pajarinen
University of Electronic Science and Technology of China, Aalto University, University of Electronic Science and Technology of China, Aalto University
Abstract:
A fundamental challenge in multiagent reinforcement learning (MARL) is to learn the joint policy in an extremely large search space, which grows exponentially with the number of agents. Moreover, fully decentralized policy factorization significantly restricts the search space, which may lead to sub-optimal policies. In contrast, the auto-regressive joint policy can represent a much richer class of joint policies by factorizing the joint policy into the product of a series of conditional individual policies. While such factorization introduces the action dependency among agents explicitly in sequential execution, it does not take full advantage of the dependency during learning. In particular, the subsequent agents do not give the preceding agents feedback about their decisions. In this paper, we propose a new framework Back-Propagation Through Agents (BPTA) that directly accounts for both agents' own policy updates and the learning of their dependent counterparts. This is achieved by propagating the feedback through action chains. With the proposed framework, our Bidirectional Proximal Policy Optimisation (BPPO) outperforms the state-of-the-art methods. Extensive experiments on matrix games, StarCraftII v2, Multi-agent MuJoCo, and Google Research Football demonstrate the effectiveness of the proposed method.



Paperid:1522
Authors:Jiaxuan Liang, Jun Wang, Guoxian Yu, Shuyin Xia, Guoyin Wang
Shandong University, Shandong University, Shandong University, Chongqing Key Laboratory of Computational Intelligence Chongqing University of Posts and Telecommunications, Chongqing Key Laboratory of Computational Intelligence Chongqing University of Posts and Telecommunications
Abstract:
Unveiling, modeling, and comprehending the causal mechanisms underpinning natural phenomena stand as fundamental endeavors across myriad scientific disciplines. Meanwhile, new knowledge emerges when discovering causal relationships from data. Existing causal learning algorithms predominantly focus on the isolated effects of variables, overlook the intricate interplay of multiple variables and their collective behavioral patterns. Furthermore, the ubiquity of highdimensional data exacts a substantial temporal cost for causal algorithms. In this paper, we develop a novel method called MgCSL (Multi-granularity Causal Structure Learning), which first leverages sparse auto-encoder to explore coarse-graining strategies and causal abstractions from micro-variables to macro-ones. MgCSL then takes multi-granularity variables as inputs to train multilayer perceptrons and to delve the causality between variables. To enhance the efficacy on high-dimensional data, MgCSL introduces a simplified acyclicity constraint to adeptly search the directed acyclic graph among variables. Experimental results show that MgCSL outperforms competitive baselines, and finds out explainable causal connections on fMRI datasets.



Paperid:1523
Authors:Junjie Liang, Weijieying Ren, Hanifi Sahar, Vasant Honavar
The Pennsylvania State University, The Pennsylvania State University, The Pennsynvalia State University, The Pennsylvania State University
Abstract:
We consider the problem of predictive modeling from irregularly and sparsely sampled longitudinal data with unknown, complex correlation structures and abrupt discontinuities. To address these challenges, we introduce a novel inducing clusters longitudinal deep kernel Gaussian Process (ICDKGP). ICDKGP approximates the data generating process by a zeromean GP with a longitudinal deep kernel that models the unknown complex correlation structure in the data and a deterministic non-zero mean function to model the abrupt discontinuities. To improve the scalability and interpretability of ICDKGP, we introduce inducing clusters corresponding to centers of clusters in the training data. We formulate the training of ICDKGP as a constrained optimization problem and derive its evidence lower bound. We introduce a novel relaxation of the resulting problem which under rather mild assumptions yields a solution with error bounded relative to the original problem. We describe the results of extensive experiments demonstrating that ICDKGP substantially outperforms the state-of-the-art longitudinal methods on data with both smoothly and non-smoothly varying outcomes.



Paperid:1524
Authors:Meiyu Liang, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang, Zhe Xue
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Deep crossmodal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.



Paperid:1525
Authors:Xinyan Liang, Pinhan Fu, Qian Guo, Keyin Zheng, Yuhua Qian
Shanxi University, Shanxi University, Taiyuan University of Science and Technology, Shanxi University, Shanxi University
Abstract:
Neural architecture searchbased multi-modal classification (NAS-MMC) methods can individually obtain the optimal classifier for different multi-modal data sets in an automatic manner. However, most existing NAS-MMC methods are dramatically time consuming due to the requirement for training and evaluating enormous models. In this paper, we propose an efficient evolutionary-based NAS-MMC method called divide-and-conquer neural architecture search (DC-NAS). Specifically, the evolved population is first divided into k+1 sub-populations, and then k sub-populations of them evolve on k small-scale data sets respectively that are obtained by splitting the entire data set using the k-fold stratified sampling technique; the remaining one evolves on the entire data set. To solve the sub-optimal fusion model problem caused by the training strategy of partial data, two kinds of sub-populations that are trained using partial data and entire data exchange the learned knowledge via two special knowledge bases. With the two techniques mentioned above, DC-NAS achieves the training time reduction and classification performance improvement. Experimental results show that DC-NAS achieves the state-of-the-art results in term of classification performance, training efficiency and the number of model parameters than the compared NAS-MMC methods on three popular multi-modal tasks including multi-label movie genre classification, action recognition with RGB and body joints and dynamic hand gesture recognition.



Paperid:1526
Authors:Junlong Liao, Wenda Fu, Cong Wang, Zhongyu Wei, Jiarong Xu
Fudan University, Fudan University, Peking University, Fudan University, Fudan University
Abstract:
Deep learning methods on graph data have achieved remarkable efficacy across a variety of realworld applications, such as social network analysis and transaction risk detection. Nevertheless, recent studies have illuminated a concerning fact: even the most expressive Graph Neural Networks (GNNs) are vulnerable to graph adversarial attacks. While several methods have been proposed to enhance the robustness of GNN models against adversarial attacks, few have focused on a simple yet realistic approach: valuing the adversarial risks and focused safeguards at the node level. This empowers defenders to allocate heightened security level to vulnerable nodes, while lower to robust nodes. With this new perspective, we propose a novel graph defense strategy RisKeeper, such that the adversarial risk can be directly kept in the input graph. We start at valuing the adversarial risk, by introducing a cost-aware projected gradient descent attack that takes into account both cost avoidance and compliance with costs budgets. Subsequently, we present a learnable approach to ascertain the ideal security level for each individual node by solving a bi-level optimization problem. Through extensive experiments on four real-world datasets, we demonstrate that our method achieves superior performance surpassing state-of-the-art methods. Our in-depth case studies provide further insights into vulnerable and robust structural patterns, serving as inspiration for practitioners to exercise heightened vigilance.



Paperid:1527
Authors:Yufan Liao, Qi Wu, Xing Yan
Renmin University of China City University of Hong Kong, City University of Hong Kong, Renmin University of China
Abstract:
OutOf-Distribution (OOD) generalization is an essential topic in machine learning. However, recent research is only focusing on the corresponding methods for neural networks. This paper introduces a novel and effective solution for OOD generalization of decision tree models, named Invariant Decision Tree (IDT). IDT enforces a penalty term with regard to the unstable/varying behavior of a split across different environments during the growth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is constructed. Our proposed method is motivated by a theoretical result under mild conditions, and validated by numerical tests with both synthetic and real datasets. The superior performance compared to non-OOD tree models implies that considering OOD generalization for tree models is absolutely necessary and should be given more attention.



Paperid:1528
Authors:Yun Liao, Junfan Li, Shizhong Liao, Qinghua Hu, Jianwu Dang
Tianjin University, Tianjin University, Tianjin University, Tianjin University, Tianjin University
Abstract:
In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves an open problem related to upper bounds of hypothesis space constraints. We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses an active updating rule. Then we design a new budget maintenance mechanism, which removes a half of examples, and projects the removed examples onto a hypothesis space spanned by the remaining examples. Ahpatron adopts the above mechanism to approximate AVP. Theoretical analyses prove that Ahpatron has tighter mistake bounds, and experimental results show that Ahpatron outperforms the stateof-the-art algorithms on the same or a smaller budget.



Paperid:1529
Authors:Zhanfeng Liao, Yan Liu, Qian Zheng, Gang Pan
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University
Abstract:
A crucial reason for the success of existing NeRFbased methods is to build a neural density field for the geometry representation via multiple perceptron layers (MLPs). MLPs are continuous functions, however, real geometry or density field is frequently discontinuous at the interface between the air and the surface. Such a contrary brings the problem of unfaithful geometry representation. To this end, this paper proposes spiking NeRF, which leverages spiking neurons and a hybrid Artificial Neural Network (ANN)-Spiking Neural Network (SNN) framework to build a discontinuous density field for faithful geometry representation. Specifically, we first demonstrate the reason why continuous density fields will bring inaccuracy. Then, we propose to use the spiking neurons to build a discontinuous density field. We conduct a comprehensive analysis for the problem of existing spiking neuron models and then provide the numerical relationship between the parameter of the spiking neuron and the theoretical accuracy of geometry. Based on this, we propose a bounded spiking neuron to build the discontinuous density field. Our method achieves SOTA performance. The source code and the supplementary material are available at https://github.com/liaozhanfeng/Spiking-NeRF.



Paperid:1530
Authors:Julian Lienen, Eyke Hüllermeier
Department of Computer Science, Paderborn University, Institute of Informatics, LMU Munich Munich Center for Machine Learning
Abstract:
Label noise poses an important challenge in machine learning, especially in deep learning, in which large models with high expressive power dominate the field. Models of that kind are prone to memorizing incorrect labels, thereby harming generalization performance. Many methods have been proposed to address this problem, including robust loss functions and more complex label correction approaches. Robust loss functions are appealing due to their simplicity, but typically lack flexibility, while label correction usually adds substantial complexity to the training setup. In this paper, we suggest to address the shortcomings of both methodologies by "ambiguating" the target information, adding additional, complementary candidate labels in case the learner is not sufficiently convinced of the observed training label. More precisely, we leverage the framework of socalled superset learning to construct set-valued targets based on a confidence threshold, which deliver imprecise yet more reliable beliefs about the ground-truth, effectively helping the learner to suppress the memorization effect. In an extensive empirical evaluation, our method demonstrates favorable learning behavior on synthetic and real-world noise, confirming the effectiveness in detecting and correcting erroneous training labels.



Paperid:1531
Authors:Haoxin Lin, Hongqiu Wu, Jiaji Zhang, Yihao Sun, Junyin Ye, Yang Yu
National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University Polixir Technologies, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University Polixir Technologies, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University Polixir Technologies Peng Cheng Laboratory
Abstract:
Realworld decision-making problems are usually accompanied by delayed rewards, which affects the sample efficiency of Reinforcement Learning, especially in the extremely delayed case where the only feedback is the episodic reward obtained at the end of an episode. Episodic return decomposition is a promising way to deal with the episodic-reward setting. Several corresponding algorithms have shown remarkable effectiveness of the learned step-wise proxy rewards from return decomposition. However, these existing methods lack either attribution or representation capacity, leading to inefficient decomposition in the case of long-term episodes. In this paper, we propose a novel episodic return decomposition method called Diaster (Difference of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point, and the step-wise proxy rewards come from differences in expectation. We theoretically and empirically verify that the decomposed proxy reward function can guide the policy to be nearly optimal. Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance. The code is available at https://github.com/HxLyn3/Diaster.



Paperid:1532
Authors:Jimmy Lin, Junkai Li, Jiasi Gao, Weizhi Ma, Yang Liu
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China, Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China, Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China, Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China, Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China Department of Computer Science and Technology, Tsinghua University, Beijing, China
Abstract:
Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in suboptimal performances. In this paper, we design Spatio-Temporal Aware tactility Transformer (STAT) to utilize continuous tactile signals for action classification. We propose spatial and temporal embeddings along with a new temporal pretraining task in our model, which aims to enhance the transformer in modeling the spatio-temporal features of tactile signals. Specially, the designed temporal pretraining task is to differentiate the time order of tubelet inputs to model the temporal properties explicitly. Experimental results on a public action classification dataset demonstrate that our model outperforms state-of-the-art methods in all metrics.



Paperid:1533
Authors:Qiuzhen Lin, Yangfan Chen, Lijia Ma, Wei-Neng Chen, Jianqiang Li
Shenzhen University, Shenzhen University, Shenzhen University, South China University of Technology, Shenzhen University
Abstract:
Recently, an emerging research direction called Evolutionary Reinforcement Learning (ERL) has been proposed, which combines evolutionary algorithm with reinforcement learning (RL) for tackling the tasks of sequential decision making. However, the recently proposed ERL algorithms often suffer from two challenges: the inaccuracy of policy estimation caused by the overestimation bias in RL and the insufficiency of exploration caused by inefficient mutations. To alleviate these problems, we propose an Evolutionary Reinforcement Learning algorithm enhanced with Truncated variance and Distillation mutation, called ERLTD. We utilize multiple Q-networks to evaluate state-action pairs, so that multiple networks can provide more accurate evaluations for state-action pairs, in which the variance of evaluations can be adopted to control the overestimation bias in RL. Moreover, we propose a new distillation mutation to provide a promising mutation direction, which is different from traditional mutation generating a large number of random solutions. We evaluate ERL-TD on the continuous control benchmarks from the OpenAI Gym and DeepMind Control Suite. The experiments show that ERL-TD shows excellent performance and outperforms all baseline RL algorithms on the test suites.



Paperid:1534
Authors:Wei Lin, Xu Peng, Zhengtao Yu, Taisong Jin
Xiamen University, Xiamen University, Kunming University of Science and Technology, Xiamen University
Abstract:
In recent years, Hypergraph Neural Networks (HGNNs) have achieved considerable success by manually designing architectures, which are capable of extracting effective patterns with highorder interactions from non-Euclidean data. However, such mechanism is extremely inefficient, demanding tremendous human efforts to tune diverse model parameters. In this paper, we propose a novel Hypergraph Neural Architecture Search (HyperNAS) to automatically design the optimal HGNNs. The proposed model constructs a search space suitable for hypergraphs, and derives hypergraph architectures through differentiable search strategies. A hypergraph structure-aware distance criterion is introduced as a guideline for obtaining an optimal hypergraph architecture via the leave-one-out method. Experimental results for node classification on benchmark Cora, Citeseer, Pubmed citation networks and hypergraph datasets show that HyperNAS outperforms existing HGNNs models and graph NAS methods.



Paperid:1535
Authors:Zhipeng Lin, Wenjing Yang, Haotian Wang, Haoang Chi, Long Lan, Ji Wang
National University of Defense Technology, National University of Defense Technology, Academy of Military Science National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology
Abstract:
Fewshot learning (FSL) aims to enable learning models with the ability to automatically adapt to novel (unseen) domains in open-world scenarios. Nonetheless, there exists a significant disparity between the vast number of new concepts encountered in the open world and the restricted available scale of existing FSL works, which primarily focus on a limited number of novel classes. Such a gap hinders the practical applicability of FSL in realistic scenarios. To bridge this gap, we propose a new problem named Few-Shot Learning with Many Novel Classes (FSL-MNC) by substantially enlarging the number of novel classes, exceeding the count in the traditional FSL setup by over 500-fold. This new problem exhibits two major challenges, including the increased computation overhead during meta-training and the degraded classification performance by the large number of classes during meta-testing. To overcome these challenges, we propose a Simple Hierarchy Pipeline (SHA-Pipeline). Due to the inefficiency of traditional protocols of EML, we re-design a lightweight training strategy to reduce the overhead brought by much more novel classes. To capture discriminative semantics across numerous novel classes, we effectively reconstruct and leverage the class hierarchy information during meta-testing. Experiments show that the proposed SHA-Pipeline significantly outperforms not only the ProtoNet baseline but also the state-of-the-art alternatives across different numbers of novel classes.



Paperid:1536
Authors:Ao Liu, Wenshan Li, Tao Li, Beibei Li, Hanyuan Huang, Pan Zhou
Sichuan University, Chengdu University of Information Technology, Sichuan University, Sichuan University, Sichuan University, Huazhong University of Science and Technology
Abstract:
Graph neural networks (GNNs) have recently been shown to be vulnerable to adversarial attacks, where slight perturbations in the graph structure can lead to erroneous predictions. However, current robust models for defending against such attacks inherit the transductive limitations of graph convolutional networks (GCNs). As a result, they are constrained by fixed structures and do not naturally generalize to unseen nodes. Here, we discover that transductive GCNs inherently possess a distillable robustness, achieved through a waveinduced resonance process. Based on this, we foster this resonance to facilitate inductive and robust learning. Specifically, we first prove that the signal formed by GCN-driven message passing (MP) is equivalent to the edge-based Laplacian wave, where, within a wave system, resonance can naturally emerge between the signal and its transmitting medium. This resonance provides inherent resistance to malicious perturbations inflicted on the signal system. We then prove that merely three MP iterations within GCNs can induce signal resonance between nodes and edges, manifesting as a coupling between nodes and their distillable surrounding local subgraph. Consequently, we present Graph Resonance-fostering Network (GRN) to foster this resonance via learning node representations from their distilled resonating subgraphs. By capturing the edge-transmitted signals within this subgraph and integrating them with the node signal, GRN embeds these combined signals into the central node's representation. This node-wise embedding approach allows for generalization to unseen nodes. We validate our theoretical findings with experiments, and demonstrate that GRN generalizes robustness to unseen nodes, whilst maintaining state-of-the-art classification accuracy on perturbed graphs. Appendices can be found on arXiv version: https://arxiv.org/abs/2312.08651



Paperid:1537
Authors:Chengliang Liu, Jinlong Jia, Jie Wen, Yabo Liu, Xiaoling Luo, Chao Huang, Yong Xu
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Shenzhen University, Shenzhen Campus of Sun Yat-sen University, Harbin Institute of Technology, Shenzhen
Abstract:
As a combination of emerging multiview learning methods and traditional multi-label classification tasks, multi-view multi-label classification has shown broad application prospects. The diverse semantic information contained in heterogeneous data effectively enables the further development of multi-label classification. However, the widespread incompleteness problem on multi-view features and labels greatly hinders the practical application of multi-view multi-label classification. Therefore, in this paper, we propose an attention-induced missing instances imputation technique to enhance the generalization ability of the model. Different from existing incomplete multi-view completion methods, we attempt to approximate the latent features of missing instances in embedding space according to cross-view joint attention, instead of recovering missing views in kernel space or original feature space. Accordingly, multi-view completed features are dynamically weighted by the confidence derived from joint attention in the late fusion phase. In addition, we propose a multi-view multi-label classification framework based on label-semantic feature learning, utilizing the statistical weak label correlation matrix and graph attention network to guide the learning process of label-specific features. Finally, our model is compatible with missing multi-view and partial multi-label data simultaneously and extensive experiments on five datasets confirm the advancement and effectiveness of our embedding imputation method and multi-view multi-label classification model.



Paperid:1538
Authors:Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
University of Surrey, University of Surrey, The Chinese University of Hong Kong, University of Surrey, University of Surrey
Abstract:
The audio spectrogram is a timefrequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.



Paperid:1539
Authors:I-Jieh Liu, Ci-Siang Lin, Fu-En Yang, Yu-Chiang Frank Wang
Graduate Institute of Communication Engineering, National Taiwan University, Graduate Institute of Communication Engineering, National Taiwan University NVIDIA, Graduate Institute of Communication Engineering, National Taiwan University NVIDIA, Graduate Institute of Communication Engineering, National Taiwan University NVIDIA
Abstract:
Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in a privacypreserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring the task to multi-label image classification. Nevertheless, it is still challenging for FL to deal with user heterogeneity in their local data distribution in the real-world FL scenario, and this issue becomes even more severe in multi-label image classification. Inspired by the recent success of Transformers in centralized settings, we propose a novel FL framework for multi-label classification. Since partial label correlation may be observed by local clients during training, direct aggregation of locally updated models would not produce satisfactory performances. Thus, we propose a novel FL framework of Language-Guided Transformer (FedLGT) to tackle this challenging task, which aims to exploit and transfer knowledge across different clients for learning a robust global model. Through extensive experiments on various multi-label datasets (e.g., FLAIR, MS-COCO, etc.), we show that our FedLGT is able to achieve satisfactory performance and outperforms standard FL techniques under multi-label FL scenarios. Code is available at https://github.com/Jack24658735/FedLGT.



Paperid:1540
Authors:Ji Liu, Dehua Tang, Yuanxian Huang, Li Zhang, Xiaocheng Zeng, Dong Li, Mingjie Lu, Jinzhang Peng, Yu Wang, Fan Jiang, Lu Tian, Ashish Sirasao
AMD, AMD, AMD, AMD, AMD, AMD, AMD, AMD, AMD, AMD, AMD, AMD
Abstract:
Traditional channelwise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet with directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.



Paperid:1541
Authors:Ji Liu, Juncheng Jia, Tianshi Che, Chao Huo, Jiaxiang Ren, Yang Zhou, Huaiyu Dai, Dejing Dou
Hithink RoyalFlush Information Network Co., Ltd., China, Soochow University, China Collaborative Innovation Center of Novel Software Technology and Industrialization, China, Auburn University, United States, Soochow University, China, Auburn University, United States, Auburn University, United States, North Carolina State University, United States, Boston Consulting Group, China
Abstract:
As a promising approach to deal with distributed data, Federated Learning (FL) achieves major advancements in recent years. FL enables collaborative model training by exploiting the raw data dispersed in multiple edge devices. However, the data is generally nonindependent and identically distributed, i.e., statistical heterogeneity, and the edge devices significantly differ in terms of both computation and communication capacity, i.e., system heterogeneity. The statistical heterogeneity leads to severe accuracy degradation while the system heterogeneity significantly prolongs the training process. In order to address the heterogeneity issue, we propose an Asynchronous Staleness-aware Model Update FL framework, i.e., FedASMU, with two novel methods. First, we propose an asynchronous FL system model with a dynamical model aggregation method between updated local models and the global model on the server for superior accuracy and high efficiency. Then, we propose an adaptive local model adjustment method by aggregating the fresh global model with local models on devices to further improve the accuracy. Extensive experimentation with 6 models and 5 public datasets demonstrates that FedASMU significantly outperforms baseline approaches in terms of accuracy (0.60% to 23.90% higher) and efficiency (3.54% to 97.98% faster).



Paperid:1542
Authors:Jian-Dong Liu, Zhi-Hao Tan, Zhi-Hua Zhou
Nanjing University, Nanjing University, Nanjing University
Abstract:
The learnware paradigm aims to establish a market of numerous wellperformed machine learning models, enabling users to leverage existing helpful models for their tasks instead of starting from scratch. Each learnware in the market is a model submitted by its developer, associated with a specification generated with the help of learnware market, representing the model's specialty and utility and enabling it to be identified for new user tasks. As the market continuously scales up, accommodating an ever-increasing number of learnwares, the critical challenge of the learnware paradigm is to effectively and efficiently identify the most helpful learnware(s) for a new user task without accessing the user's raw data. In this paper, to achieve increasingly accurate learnware characterization and identification along with a growing number of learnwares in the market, we propose an approach called Evolvable Learnware Specification with Index (ELSI). Specifically, based on the key idea of leveraging the task information within learnware specifications, we tackle the challenge of ascertaining the capabilities of models beyond their original training tasks, thereby enabling learnware specifications and the entire market to evolve continuously. Furthermore, through organizing learnwares and constructing specification indexes, we design a practical procedure to accurately and efficiently identify helpful learnwares without examining the entire market. Theoretical analysis and extensive experiments on a learnware market prototype encompassing thousands of models and covering six real-world scenarios validate the effectiveness and efficiency of our approach.



Paperid:1543
Authors:Jiexi Liu, Songcan Chen
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence
Abstract:
Learning universal time series representations applicable to various types of downstream tasks is challenging but valuable in real applications. Recently, researchers have attempted to leverage the success of selfsupervised contrastive learning (SSCL) in Computer Vision(CV) and Natural Language Processing(NLP) to tackle time series representation. Nevertheless, due to the special temporal characteristics, relying solely on empirical guidance from other domains may be ineffective for time series and difficult to adapt to multiple downstream tasks. To this end, we review three parts involved in SSCL including 1) designing augmentation methods for positive pairs, 2) constructing (hard) negative pairs, and 3) designing SSCL loss. For 1) and 2), we find that unsuitable positive and negative pair construction may introduce inappropriate inductive biases, which neither preserve temporal properties nor provide sufficient discriminative features. For 3), just exploring segment- or instance-level semantics information is not enough for learning universal representation. To remedy the above issues, we propose a novel self-supervised framework named TimesURL. Specifically, we first introduce a frequency-temporal-based augmentation to keep the temporal property unchanged. And then, we construct double Universums as a special kind of hard negative to guide better contrastive learning. Additionally, we introduce time reconstruction as a joint optimization objective with contrastive learning to capture both segment-level and instance-level information. As a result, TimesURL can learn high-quality universal representations and achieve state-of-the-art performance in 6 different downstream tasks, including short- and long-term forecasting, imputation, classification, anomaly detection and transfer learning.



Paperid:1544
Authors:Jin Liu, Xiaokang Pan, Junwen Duan, Hong-Dong Li, Youqi Li, Zhe Qu
School of Computer Science and Engineering, Central South University, School of Computer Science and Engineering, Central South University, School of Computer Science and Engineering, Central South University, School of Computer Science and Engineering, Central South University, School of Computer Science and Technology, Beijing Institute of Technology, School of Computer Science and Engineering, Central South University
Abstract:
This paper delves into the realm of stochastic optimization for compositional minimax optimization—a pivotal challenge across various machine learning domains, including deep AUC and reinforcement learning policy evaluation. Despite its significance, the problem of compositional minimax optimization is still underexplored. Adding to the complexity, current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes. To respond to these constraints, this paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity and obtain the nearly accuracy solution, matching the existing minimax methods. We also demonstrate that NSTORM can achieve the same sample complexity under the Polyak-Lojasiewicz (PL)-condition—an insightful extension of its capabilities. Yet, NSTORM encounters an issue with its requirement for low learning rates, potentially constraining its real-world applicability in machine learning. To overcome this hurdle, we present ADAptive NSTORM (ADA-NSTORM) with adaptive learning rates. We demonstrate that ADA-NSTORM can achieve the same sample complexity but the experimental results show its more effectiveness. All the proposed complexities indicate that our proposed methods can match lower bounds to existing minimax optimizations, without requiring a large batch size in each iteration. Extensive experiments support the efficiency of our proposed methods.



Paperid:1545
Authors:Jinsong Liu, Chenghan Xie, Qi Deng, Dongdong Ge, Yinyu Ye
Shanghai University of Finance and Economics, Fudan University, Shanghai University of Finance and Economics, Shanghai University of Finance and Economics, Stanford University
Abstract:
Value Iteration (VI) is one of the most classic algorithms for solving Markov Decision Processes (MDPs), which lays the foundations for various more advanced reinforcement learning algorithms, such as Qlearning. VI may take a large number of iterations to converge as it is a first-order method. In this paper, we introduce the Newton Value Iteration (NVI) algorithm, which eliminates the impact of action space dimension compared to some previous second-order methods. Consequently, NVI can efficiently handle MDPs with large action spaces. Building upon NVI, we propose a novel approach called Sketched Newton Value Iteration (SNVI) to tackle MDPs with both large state and action spaces. SNVI not only inherits the stability and fast convergence advantages of second-order algorithms, but also significantly reduces computational complexity, making it highly scalable. Extensive experiments demonstrate the superiority of our algorithms over traditional VI and previously proposed second-order VI algorithms.



Paperid:1546
Authors:Jinxin Liu, Ziqi Zhang, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, Donglin Wang
School of Engineering, Westlake University, School of Engineering, Westlake University, School of Engineering, Westlake university, School of Engineering, Westlake University, School of Engineering, Westlake University, School of Engineering, Westlake University, School of Engineering, Westlake University
Abstract:
Offline reinforcement learning (RL) aims to learn a policy using only precollected and fixed data. Although avoiding the time-consuming online interactions in RL, it poses challenges for out-of-distribution (OOD) state actions and often suffers from data inefficiency for training. Despite many efforts being devoted to addressing OOD state actions, the latter (data inefficiency) receives little attention in offline RL. To address this, this paper proposes the cross-domain offline RL, which assumes offline data incorporate additional source-domain data from varying transition dynamics (environments), and expects it to contribute to the offline data efficiency. To do so, we identify a new challenge of OOD transition dynamics, beyond the common OOD state actions issue, when utilizing cross-domain offline data. Then, we propose our method BOSA, which employs two support-constrained objectives to address the above OOD issues. Through extensive experiments in the cross-domain offline RL setting, we demonstrate BOSA can greatly improve offline data efficiency: using only 10% of the target data, BOSA could achieve 74.4% of the SOTA offline RL performance that uses 100% of the target data. Additionally, we also show BOSA can be effortlessly plugged into model-based offline RL and noising data augmentation techniques (used for generating source-domain data), which naturally avoids the potential dynamics mismatch between target-domain data and newly generated source-domain data.



Paperid:1547
Authors:Jinyi Liu, Zhi Wang, Yan Zheng, Jianye Hao, Chenjia Bai, Junjie Ye, Zhen Wang, Haiyin Piao, Yang Sun
College of Intelligence and Computing, Tianjin University, Independent Researcher, College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University, Shanghai AI Laboratory, Independent Researcher, Northwestern Polytechnical University, Northwestern Polytechnical University, SADRI institute
Abstract:
In reinforcement learning, the optimism in the face of uncertainty (OFU) is a mainstream principle for directing exploration towards less explored areas, characterized by higher uncertainty. However, in the presence of environmental stochasticity (noise), purely optimistic exploration may lead to excessive probing of highnoise areas, consequently impeding exploration efficiency. Hence, in exploring noisy environments, while optimism-driven exploration serves as a foundation, prudent attention to alleviating unnecessary over-exploration in high-noise areas becomes beneficial. In this work, we propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a noise-aware optimistic exploration for continuous control. OVD-Explorer proposes a new measurement of the policy's exploration ability considering noise in optimistic perspectives, and leverages gradient ascent to drive exploration. Practically, OVD-Explorer can be easily integrated with continuous control RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks demonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic exploration.



Paperid:1548
Authors:Meihan Liu, Zeyu Fang, Zhen Zhang, Ming Gu, Sheng Zhou, Xin Wang, Jiajun Bu
Zhejiang University, Zhejiang University, National University of Singapore, Zhejiang University, Zhejiang University, Tsinghua University, Zhejiang University
Abstract:
Unsupervised Graph Domain Adaptation (UGDA) aims to transfer knowledge from a labelled source graph to an unlabelled target graph in order to address the distribution shifts between graph domains. Previous works have primarily focused on aligning data from the source and target graph in the representation space learned by graph neural networks (GNNs). However, the inherent generalization capability of GNNs has been largely overlooked. Motivated by our empirical analysis, we reevaluate the role of GNNs in graph domain adaptation and uncover the pivotal role of the propagation process in GNNs for adapting to different graph domains. We provide a comprehensive theoretical analysis of UGDA and derive a generalization bound for multilayer GNNs. By formulating GNN Lipschitz for k-layer GNNs, we show that the target risk bound can be tighter by removing propagation layers in source graph and stacking multiple propagation layers in target graph. Based on the empirical and theoretical analysis mentioned above, we propose a simple yet effective approach called A2GNN for graph domain adaptation. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of our proposed A2GNN framework.



Paperid:1549
Authors:Mengpu Liu, Mengying Zhu, Xiuyuan Wang, Guofang Ma, Jianwei Yin, Xiaolin Zheng
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang Gongshang University, Zhejiang University, Zhejiang University
Abstract:
Stock movement prediction serves an important role in quantitative trading. Despite advances in existing models that enhance stock movement prediction by incorporating stock relations, these prediction models face two limitations, i.e., constructing either insufficient or static stock relations, which fail to effectively capture the complex dynamic stock relations because such complex dynamic stock relations are influenced by various factors in the everchanging financial market. To tackle the above limitations, we propose a novel stock movement prediction model ECHO-GL based on stock relations derived from earnings calls. ECHO-GL not only constructs comprehensive stock relations by exploiting the rich semantic information in the earnings calls but also captures the movement signals between related stocks based on multimodal and heterogeneous graph learning. Moreover, ECHO-GL customizes learnable stock stochastic processes based on the post earnings announcement drift (PEAD) phenomenon to generate the temporal stock price trajectory, which can be easily plugged into any investment strategy with different time horizons to meet investment demands. Extensive experiments on two financial datasets demonstrate the effectiveness of ECHO-GL on stock price movement prediction tasks together with high prediction accuracy and trading profitability.



Paperid:1550
Authors:Qingsong Liu, Zhixuan Fang
IIIS, Tsinghua University, Beijing, China, IIIS, Tsinghua University, Beijing, China Shanghai Qi Zhi Institute, Shanghai, China
Abstract:
We consider a decentralized multiplayer multi-armed bandit (MP-MAB) problem where players cannot observe the actions and rewards of other players and no explicit communication or coordination between players is possible. Prior studies mostly focus on maximizing the sum of rewards of the players over time. However, the total reward maximization learning may lead to imbalanced reward among players, leading to poor Quality of Service (QoS) for some players. In contrast, our objective is to let each player n achieve a predetermined expected average reward over time, i.e., achieving a predetermined level of QoS. We develop a novel decentralized MP-MAB algorithm to accomplish this objective by leveraging the methodology of randomized matching. We prove that our decentralized algorithm can ensure that all players have an O(1) QoS regret. We also reveal an analog between our MP-MAB model and the online wireless queuing systems, which builds a connection between QoS in MP-MAB learning and stability in queuing theory.



Paperid:1551
Authors:Ruyue Liu, Rong Yin, Yong Liu, Weiping Wang
Institute of Information Engineering, CAS School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, CAS School of Cyber Security, University of Chinese Academy of Sciences, Renmin University of China, Institute of Information Engineering, CAS, China
Abstract:
Graph Comparative Learning (GCL) is a selfsupervised method that combines the advantages of Graph Convolutional Networks (GCNs) and comparative learning, making it promising for learning node representations. However, the GCN encoders used in these methods rely on the Fourier transform to learn fixed graph representations, which is inherently limited by the uncertainty principle involving spatial and spectral localization trade-offs. To overcome the inflexibility of existing methods and the computationally expensive eigen-decomposition and dense matrix multiplication, this paper proposes an Adaptive Spectral Wavelet Transform-based Self-Supervised Graph Neural Network (ASWT-SGNN). The proposed method employs spectral adaptive polynomials to approximate the filter function and optimize the wavelet using contrast loss. This design enables the creation of local filters in both spectral and spatial domains, allowing flexible aggregation of neighborhood information at various scales and facilitating controlled transformation between local and global information. Compared to existing methods, the proposed approach reduces computational complexity and addresses the limitation of graph convolutional neural networks, which are constrained by graph size and lack flexible control over the neighborhood aspect. Extensive experiments on eight benchmark datasets demonstrate that ASWT-SGNN accurately approximates the filter function in high-density spectral regions, avoiding costly eigen-decomposition. Furthermore, ASWT-SGNN achieves comparable performance to state-of-the-art models in node classification tasks.



Paperid:1552
Authors:Shizhan Liu, Zhengkai Jiang, Yuxi Li, Jinlong Peng, Yabiao Wang, Weiyao Lin
Shanghai Jiao Tong University, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Shanghai Jiao Tong University
Abstract:
Active domain adaptation has emerged as a solution to balance the expensive annotation cost and the performance of trained models in semantic segmentation. However, existing works usually ignore the correlation between selected samples and its local context in feature space, which leads to inferior usage of annotation budgets. In this work, we revisit the theoretical bound of the classical Coreset method and identify that the performance is closely related to the local sample distribution around selected samples. To estimate the density of local samples efficiently, we introduce a local proxy estimator with Dynamic Masked Convolution and develop a Density-aware Greedy algorithm to optimize the bound. Extensive experiments demonstrate the superiority of our approach. Moreover, with very few labels, our scheme achieves comparable performance to the fully supervised counterpart.



Paperid:1553
Authors:Sihang Liu, Wenming Cao, Ruigang Fu, Kaixiang Yang, Zhiwen Yu
South China University of Technology, Chongqing Jiaotong University, National University of Defense Technology, South China University of Technology, South China University of Technology Peng Cheng Laboratory
Abstract:
Clustering methods achieve performance improvement by jointly learning representation and cluster assignment. However, they do not consider the confidence of pseudolabels which are not optimal as supervised information, resulting into error accumulation. To address this issue, we propose a Robust Pseudo-labeling for Semantic Clustering (RPSC) approach, which includes two stages. In the first stage (RPSC-Self), we design a semantic pseudo-labeling scheme by using the consistency of samples, i.e., samples with same semantics should be close to each other in the embedding space. To exploit robust semantic pseudo-labels for self-supervised learning, we propose a soft contrastive loss (SCL) which encourage the model to believe high-confidence sematic pseudo-labels and be less driven by low-confidence pseudo-labels. In the second stage (RPSC-Semi), we first determine the semantic pseudo-label of a sample based on the distance between itself and cluster centers, followed by screening out reliable semantic pseudo-label by exploiting the consistency. These reliable pseudo-labels are used as supervised information in the pseudo-semi-supervised learning algorithm to further improve the performance. Experimental results show that RPSC outperforms 18 competitive clustering algorithms significantly on six challenging image benchmarks. In particular, RPSC achieves an accuracy of 0.688 on ImageNet-Dogs, which is an up to 24% improvement, compared with the second-best method. Meanwhile, we conduct ablation studies to investigate effects of different augmented strategies on RPSC as well as contributions of terms in SCL to clustering performance. Besides, experimental results indicate that SCL can be easily integrated into existing clustering methods and bring performance improvement.



Paperid:1554
Authors:Suyuan Liu, Junpu Zhang, Yi Wen, Xihong Yang, Siwei Wang, Yi Zhang, En Zhu, Chang Tang, Long Zhao, Xinwang Liu
School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, Intelligent Game and Decision Lab, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Computer Science, China University of Geosciences, Shandong Computer Science Center, Qilu University of Technology, School of Computer, National University of Defense Technology
Abstract:
Incomplete multiview clustering has attracted much attention due to its ability to handle partial multi-view data. Recently, similarity-based methods have been developed to explore the complete relationship among incomplete multi-view data. Although widely applied to partial scenarios, most of the existing approaches are still faced with two limitations. Firstly, fusing similarities constructed individually on each view fails to yield a complete unified similarity. Moreover, incomplete similarity generation may lead to anomalous similarity values with column sum constraints, affecting the final clustering results. To solve the above challenging issues, we propose a Sample-level Cross-view Similarity Learning (SCSL) method for Incomplete Multi-view Clustering. Specifically, we project all samples to the same dimension and simultaneously construct a complete similarity matrix across views based on the inter-view sample relationship and the intra-view sample relationship. In addition, a simultaneously learning consensus representation ensures the validity of the projection, which further enhances the quality of the similarity matrix through the graph Laplacian regularization. Experimental results on six benchmark datasets demonstrate the ability of SCSL in processing incomplete multi-view clustering tasks. Our code is publicly available at https://github.com/Tracesource/SCSL.



Paperid:1555
Authors:Xinhui Liu, Zhenghao Chen, Luping Zhou, Dong Xu, Wei Xi, Gairui Bai, Yihan Zhao, Jizhong Zhao
Xi'an Jiaotong University University of Sydney, University of Sydney, University of Sydney, The University of Hong Kong, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, which makes them significantly less feasible for realworld situations and introduces security hazards. This paper relaxes the assumptions from previous FDAs and studies a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label set information of each source domain, while the label sets of different source domains could be inconsistent, and the target-domain label set is totally blind. Towards a more effective solution for our newly proposed UFDA scenario, we propose a corresponding methodology called Hot-Learning with Contrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain shifts and category gaps problems by using one-hot outputs from the black-box models of various source domains. Moreover, to better distinguish the shared and unknown classes, we further present a cluster-level strategy named Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains. Extensive experiments on three benchmark datasets demonstrate that our method achieves comparable performance for our UFDA scenario with much fewer assumptions, compared to previous methodologies with comprehensive additional assumptions.



Paperid:1556
Authors:Yanhe Liu, Peng Wang, Wenjun Ke, Guozheng Li, Xiye Chen, Jiteng Zhao, Ziyu Shang
School of Computer Science and Engineering, Southeast University, Nanjing, China, School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing, China, Nanjing University of Finance & Economics, School of Computer Science and Engineering, Southeast University, Nanjing, China, School of Computer Science and Engineering, Southeast University, Nanjing, China
Abstract:
Supervised named entity recognition (NER) aims to classify entity mentions into a fixed number of predefined types. However, in real-world scenarios, unknown entity types are continually involved. Naive fine-tuning will result in catastrophic forgetting on old entity types. Existing continual methods usually depend on knowledge distillation to alleviate forgetting, which are less effective on long task sequences. Moreover, most of them are specific to the class-incremental scenario and cannot adapt to the online scenario, which is more common in practice. In this paper, we propose a unified framework called Contrastive Real-time Updating Prototype (CRUP) that can handle different scenarios for NER. Specifically, we train a Gaussian projection model by a regularized contrastive objective. After training on each batch, we store the mean vectors of representations belong to new entity types as their prototypes. Meanwhile, we update existing prototypes belong to old types only based on representations of the current batch. The final prototypes will be used for the nearest class mean classification. In this way, CRUP can handle different scenarios through its batch-wise learning. Moreover, CRUP can alleviate forgetting in continual scenarios only with current data instead of old data. To comprehensively evaluate CRUP, we construct extensive benchmarks based on various datasets. Experimental results show that CRUP significantly outperforms baselines in continual scenarios and is also competitive in the supervised scenario.



Paperid:1557
Authors:Yu Liu, Runzhe Wan, James McQueen, Doug Hains, Jinxiang Gu, Rui Song
Amazon.com Inc, Amazon.com Inc, Amazon.com Inc, Amazon.com Inc, Amazon.com Inc, Amazon.com Inc
Abstract:
The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of datadriven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches.



Paperid:1558
Authors:Yu Liu, Guihe Qin, Haipeng Chen, Zhiyong Cheng, Xun Yang
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China, Qilu University of Technology (Shandong Academy of Sciences), JiNan, China, University of Science and Technology of China, HeFei, China
Abstract:
Textbased Person Retrieval (TPR) aims to retrieve relevant images of specific pedestrians based on the given textual query. The mainstream approaches primarily leverage pretrained deep neural networks to learn the mapping of visual and textual modalities into a common latent space for cross-modality matching. Despite their remarkable achievements, existing efforts mainly focus on learning the statistical cross-modality correlation found in training data, other than the intrinsic causal correlation. As a result, they often struggle to retrieve accurately in the face of environmental changes such as illumination, pose, and occlusion, or when encountering images with similar attributes. In this regard, we pioneer the observation of TPR from a causal view. Specifically, we assume that each image is composed of a mixture of causal factors (which are semantically consistent with text descriptions) and non-causal factors (retrieval-irrelevant, e.g., background), and only the former can lead to reliable retrieval judgments. Our goal is to extract text-critical robust visual representation (i.e., causal factors) and establish domain invariant cross-modality correlations for accurate and reliable retrieval. However, causal/non-causal factors are unobserved, so we emphasize that ideal causal factors that can simulate causal scenes should satisfy two basic principles:1) Independence: being independent of non-causal factors, and 2)Sufficiency: being causally sufficient for TPR across different environments. Building on that, we propose an Invariant Representation Learning method for TPR (IRLT), that enforces the visual representations to satisfy the two aforementioned critical properties. Extensive experiments on three datasets clearly demonstrate the advantages of IRLT over leading baselines in terms of accuracy and generalization.



Paperid:1559
Authors:Yuhang Liu, Daowan Peng, Wei Wei, Yuanyuan Fu, Wenfeng Xie, Dangyang Chen
CCIIP Lab, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL) ByteDance Inc., CCIIP Lab, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), CCIIP Lab, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Ping An Property & Casualty Insurance Company of China, Ltd Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Ping An Property & Casualty Insurance Company of China, Ltd Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Ping An Property & Casualty Insurance Company of China, Ltd Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL)
Abstract:
Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multihop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, Detection-based Intermediate Supervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions. Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.



Paperid:1560
Authors:Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Peking University, Microsoft Corporation, Microsoft Corporation, Microsoft Corporation, Microsoft Corporation, Microsoft Corporation, Microsoft Corporation, Microsoft Corporation, Microsoft Corporation
Abstract:
Diffusion models have demonstrated exceptional capability in generating highquality images, videos, and audio. Due to their adaptiveness in iterative refinement, they provide a strong potential for achieving better non-autoregressive sequence generation. However, existing text diffusion models still fall short in their performance due to a challenge in handling the discreteness of language. This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling. Motivated by our findings, we propose a novel Text Diffusion model called TReC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling. Our extensive experiments demonstrate the competitiveness of TReC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.



Paperid:1561
Authors:Zhen Liu, Wenbin Pei, Disen Lan, Qianli Ma
South China University of Technology, Dalian University of Technology, South China University of Technology, South China University of Technology
Abstract:
Semisupervised time-series classification could effectively alleviate the issue of lacking labeled data. However, existing approaches usually ignore model interpretability, making it difficult for humans to understand the principles behind the predictions of a model. Shapelets are a set of discriminative subsequences that show high interpretability in time series classification tasks. Shapelet learning-based methods have demonstrated promising classification performance. Unfortunately, without enough labeled data, the shapelets learned by existing methods are often poorly discriminative, and even dissimilar to any subsequence of the original time series. To address this issue, we propose the Diffusion Language-Shapelets model (DiffShape) for semi-supervised time series classification. In DiffShape, a self-supervised diffusion learning mechanism is designed, which uses real subsequences as a condition. This helps to increase the similarity between the learned shapelets and real subsequences by using a large amount of unlabeled data. Furthermore, we introduce a contrastive language-shapelets learning strategy that improves the discriminability of the learned shapelets by incorporating the natural language descriptions of the time series. Experiments have been conducted on the UCR time series archive, and the results reveal that the proposed DiffShape method achieves state-of-the-art performance and exhibits superior interpretability over baselines.



Paperid:1562
Authors:Zhuanghua Liu, Bryan Kian Hsiang Low
National University of Singapore CNRS@CREATE LTD, 1 Create Way, #08-01 CREATE Tower, Singapore 138602, National University of Singapore
Abstract:
We consider the optimization problem of minimizing the sumof-nonconvex function, i.e., a convex function that is the average of nonconvex components. The existing stochastic algorithms for such a problem only focus on a single machine and the centralized scenario. In this paper, we study the sum-of-nonconvex optimization in the decentralized setting. We present a new theoretical analysis of the PMGT-SVRG algorithm for this problem and prove the linear convergence of their approach. However, the convergence rate of the PMGT-SVRG algorithm has a linear dependency on the condition number, which is undesirable for the ill-conditioned problem. To remedy this issue, we propose an accelerated stochastic decentralized first-order algorithm by incorporating the techniques of acceleration, gradient tracking, and multi-consensus mixing into the SVRG algorithm. The convergence rate of the proposed method has a square-root dependency on the condition number. The numerical experiments validate the theoretical guarantee of our proposed algorithms on both synthetic and real-world datasets.



Paperid:1563
Authors:Zhuanghua Liu, Luo Luo, Bryan Kian Hsiang Low
National University of Singapore, CNRS@CREATE LTD, 1 Create Way, #08-01 CREATE Tower, Singapore 138602, Fudan University, National University of Singapore
Abstract:
We consider the finitesum optimization problem, where each component function is strongly convex and has Lipschitz continuous gradient and Hessian. The recently proposed incremental quasi-Newton method is based on BFGS update and achieves a local superlinear convergence rate that is dependent on the condition number of the problem. This paper proposes a more efficient quasi-Newton method by incorporating the symmetric rank-1 update into the incremental framework, which results in the condition-number-free local superlinear convergence rate. Furthermore, we can boost our method by applying the block update on the Hessian approximation, which leads to an even faster local convergence rate. The numerical experiments show the proposed methods significantly outperform the baseline methods.



Paperid:1564
Authors:Zichen Liu, Hongbo Sun, Yuxin Peng, Jiahuan Zhou
Peking University, Peking University, Peking University, Peking University
Abstract:
As an upand-coming area, CLIP-based pre-trained vision-language models can readily facilitate downstream tasks through the zero-shot or few-shot fine-tuning manners. However, they still face critical challenges in test-time generalization due to the shifts between the training and test data distributions, hindering the further improvement of the performance. To address this crucial problem, the latest works have introduced Test-Time Adaptation (TTA) techniques to CLIP which dynamically learn text prompts using only test samples. However, their limited learning capacity due to the overlook of visual modality information, and the underutilization of knowledge in previously seen test samples result in reduced performance. In this paper, we propose a novel Dual-modal Adaptive online prompting and knowledge ReTention method called DART to overcome these challenges. To increase the learning capacity, DART captures knowledge from each test sample by learning class-specific text prompts and instance-level image prompts. Additionally, to fully leverage the knowledge from previously seen test samples, DART utilizes dual-modal knowledge retention prompts to adaptively retain the acquired knowledge, thereby enhancing the predictions on subsequent test samples. Extensive experiments on various large-scale benchmarks demonstrate the effectiveness of our proposed DART against state-of-the-art methods.



Paperid:1565
Authors:Zihao Liu, Tianhao Wang, Mengdi Huai, Chenglin Miao
Iowa State University, University of Virginia, Iowa State University, Iowa State University
Abstract:
As a new paradigm to erase data from a model and protect user privacy, machine unlearning has drawn significant attention. However, existing studies on machine unlearning mainly focus on its effectiveness and efficiency, neglecting the security challenges introduced by this technique. In this paper, we aim to bridge this gap and study the possibility of conducting malicious attacks leveraging machine unlearning. Specifically, we consider the backdoor attack via machine unlearning, where an attacker seeks to inject a backdoor in the unlearned model by submitting malicious unlearning requests, so that the prediction made by the unlearned model can be changed when a particular trigger presents. In our study, we propose two attack approaches. The first attack approach does not require the attacker to poison any training data of the model. The attacker can achieve the attack goal only by requesting to unlearn a small subset of his contributed training data. The second approach allows the attacker to poison a few training instances with a predefined trigger upfront, and then activate the attack via submitting a malicious unlearning request. Both attack approaches are proposed with the goal of maximizing the attack utility while ensuring attack stealthiness. The effectiveness of the proposed attacks is demonstrated with different machine unlearning algorithms as well as different models on different datasets.



Paperid:1566
Authors:Michael Livanos, Ian Davidson, Stephen Wong
University of California, Davis, University of California, Davis, University of California, Davis
Abstract:
Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which restrict its use: all knowledge is transferred from teacher to student regardless of whether or not that knowledge is useful, the student is the only one learning in this exchange, and typically distillation transfers knowledge only from a single teacher to a single student. We formulate a novel form of knowledge distillation in which many models can act as both students and teachers which we call cooperative distillation. The models cooperate as follows: a model (the student) identifies specific deficiencies in it's performance and searches for another model (the teacher) who encodes learned knowledge into instructional virtual instances via counterfactual instance generation. Because different models may have different strengths and weaknesses, all models can act as either students or teachers (cooperation) when appropriate and only distill knowledge in areas specific to their strengths (focus). Since counterfactuals as a paradigm are not tied to any specific algorithm, we can use this method to distill knowledge between learners of different architectures, algorithms, and even feature spaces. We demonstrate our approach not only outperforms baselines such as transfer learning, selfsupervised learning, and multiple knowledge distillation algorithms on several datasets, but it can also be used in settings where the aforementioned techniques cannot.



Paperid:1567
Authors:Sheng Long, Wei Tao, Shuohao LI, Jun Lei, Jun Zhang
Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China, Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China Strategic Assessments and Consultation Institute, Academy of Military Science, Beijing 100091, China, Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China, Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China, Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China
Abstract:
Adversarial examples are commonly created by solving a constrained optimization problem, typically using signbased methods like Fast Gradient Sign Method (FGSM). These attacks can benefit from momentum with a constant parameter, such as Momentum Iterative FGSM (MI-FGSM), to enhance black-box transferability. However, the monotonic time-varying momentum parameter is required to guarantee convergence in theory, creating a theory-practice gap. Additionally, recent work shows that sign-based methods fail to converge to the optimum in several convex settings, exacerbating the issue. To address these concerns, we propose a novel method which incorporates both an innovative adaptive momentum parameter without monotonicity assumptions and an adaptive step-size scheme that replaces the sign operation. Furthermore, we derive a regret upper bound for general convex functions. Experiments on multiple models demonstrate the efficacy of our method in generating adversarial examples with human-imperceptible noise while achieving high attack success rates, indicating its superiority over previous adversarial example generation methods.



Paperid:1568
Authors:Guy Lorberbom, Itai Gat, Yossi Adi, Alexander Schwing, Tamir Hazan
Technion, Facebook AI Research, The Hebrew University of Jerusalem, University of Illinois at Urbana-Champaign, Technion
Abstract:
Backpropagation, which uses the chain rule, is the defacto standard algorithm for optimizing neural networks nowadays. Recently, Hinton (2022) proposed the forward-forward algorithm, a promising alternative that optimizes neural nets layer-by-layer, without propagating gradients throughout the network. Although such an approach has several advantages over back-propagation and shows promising results, the fact that each layer is being trained independently limits the optimization process. Specifically, it prevents the network's layers from collaborating to learn complex and rich features. In this work, we study layer collaboration in the forward-forward algorithm. We show that the current version of the forward-forward algorithm is suboptimal when considering information flow in the network, resulting in a lack of collaboration between layers of the network. We propose an improved version that supports layer collaboration to better utilize the network structure, while not requiring any additional assumptions or computations. We empirically demonstrate the efficacy of the proposed version when considering both information flow and objective metrics. Additionally, we provide a theoretical motivation for the proposed method, inspired by functional entropy theory.



Paperid:1569
Authors:Feng Lu, Wei Li, Yifei Sun, Cheng Song, Yufei Ren, Albert Y. Zomaya
School of Computer Science and Technology, Huazhong University of Science and Technology, China, The Australia-China Joint Research Centre for Energy Informatics and Demand Response Technologies, Centre for Distributed and High Performance Computing, School of Computer Science, The University of Sydney, Australia, School of Computer Science and Technology, Huazhong University of Science and Technology, China, School of Computer Science and Technology, Huazhong University of Science and Technology, China, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, China, The Australia-China Joint Research Centre for Energy Informatics and Demand Response Technologies, Centre for Distributed and High Performance Computing, School of Computer Science, The University of Sydney, Australia
Abstract:
Artificial intelligence (AI) has immense potential in time series prediction, but most explainable tools have limited capabilities in providing a systematic understanding of important features over time. These tools typically rely on evaluating a single time point, overlook the time ordering of inputs, and neglect the timesensitive nature of time series applications. These factors make it difficult for users, particularly those without domain knowledge, to comprehend AI model decisions and obtain meaningful explanations. We propose CGS-Mask, a post-hoc and model-agnostic cellular genetic strip mask-based saliency approach to address these challenges. CGS-Mask uses consecutive time steps as a cohesive entity to evaluate the impact of features on the final prediction, providing binary and sustained feature importance scores over time. Our algorithm optimizes the mask population iteratively to obtain the optimal mask in a reasonable time. We evaluated CGS-Mask on synthetic and real-world datasets, and it outperformed state-of-the-art methods in elucidating the importance of features over time. According to our pilot user study via a questionnaire survey, CGS-Mask is the most effective approach in presenting easily understandable time series prediction results, enabling users to comprehend the decision-making process of AI models with ease.



Paperid:1570
Authors:Kangkang Lu, Yanhua Yu, Hao Fei, Xuan Li, Zixuan Yang, Zirui Guo, Meiyu Liang, Mengran Yin, Tat-Seng Chua
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, National University of Singapore, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, National University of Singapore
Abstract:
In recent years, spectral graph neural networks, characterized by polynomial filters, have garnered increasing attention and have achieved remarkable performance in tasks such as node classification. These models typically assume that eigenvalues for the normalized Laplacian matrix are distinct from each other, thus expecting a polynomial filter to have a high fitting ability. However, this paper empirically observes that normalized Laplacian matrices frequently possess repeated eigenvalues. Moreover, we theoretically establish that the number of distinguishable eigenvalues plays a pivotal role in determining the expressive power of spectral graph neural networks. In light of this observation, we propose an eigenvalue correction strategy that can free polynomial filters from the constraints of repeated eigenvalue inputs. Concretely, the proposed eigenvalue correction strategy enhances the uniform distribution of eigenvalues, thus mitigating repeated eigenvalues, and improving the fitting capacity and expressive power of polynomial filters. Extensive experimental results on both synthetic and realworld datasets demonstrate the superiority of our method.



Paperid:1571
Authors:Liming Lu, Zhenghan Chen, Xiaoyu Lu, Yihang Rao, Lujun Li, Shuchao Pang
School of Cyber Science and Engineering, Nanjing University of Science and Technology, Peking University, School of Cyber Science and Engineering, Nanjing University of Science and Technology, School of Cyber Science and Engineering, Nanjing University of Science and Technology, The Hong Kong University of Science and Technology, School of Cyber Science and Engineering, Nanjing University of Science and Technology School of Computing, Macquarie University
Abstract:
In this paper, we present UniADS, the first Universal ArchitectureDistiller Search framework for co-optimizing student architecture and distillation policies. Teacher-student distillation gap limits the distillation gains. Previous approaches seek to discover the ideal student architecture while ignoring distillation settings. In UniADS, we construct a comprehensive search space encompassing an architectural search for student models, knowledge transformations in distillation strategies, distance functions, loss weights, and other vital settings. To efficiently explore the search space, we utilize the NSGA-II genetic algorithm for better crossover and mutation configurations and employ the Successive Halving algorithm for search space pruning, resulting in improved search efficiency and promising results. Extensive experiments are performed on different teacher-student pairs using CIFAR-100 and ImageNet datasets. The experimental results consistently demonstrate the superiority of our method over existing approaches. Furthermore, we provide a detailed analysis of the search results, examining the impact of each variable and extracting valuable insights and practical guidance for distillation design and implementation.



Paperid:1572
Authors:Weigang Lu, Ziyu Guan, Wei Zhao, Yaming Yang, Long Jin
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Graph Neural Networks (GNNs) have become mainstream methods for solving the semisupervised node classification problem. However, due to the uneven location distribution of labeled nodes in the graph, labeled nodes are only accessible to a small portion of unlabeled nodes, leading to the under-reaching issue. In this study, we firstly reveal under-reaching by conducting an empirical investigation on various well-known graphs. Then, we demonstrate that under-reaching results in unsatisfactory distribution alignment between labeled and unlabeled nodes through systematic experimental analysis, significantly degrading GNNs' performance. To tackle under-reaching for GNNs, we propose an architecture-agnostic method dubbed NodeMixup. The fundamental idea is to (1) increase the reachability of labeled nodes by labeled-unlabeled pairs mixup, (2) leverage graph structures via fusing the neighbor connections of intra-class node pairs to improve performance gains of mixup, and (3) use neighbor label distribution similarity incorporating node degrees to determine sampling weights for node mixup. Extensive experiments demonstrate the efficacy of NodeMixup in assisting GNNs in handling under-reaching. The source code is available at https://github.com/WeigangLu/NodeMixup.



Paperid:1573
Authors:Yang Lu, Lin Chen, Yonggang Zhang, Yiliang Zhang, Bo Han, Yiu-ming Cheung, Hanzi Wang
Xiamen University, Xiamen University, Hong Kong Baptist University, Xiamen University, Hong Kong Baptist University, Hong Kong Baptist University, Xiamen University
Abstract:
Federated learning (FL) has shown remarkable success in cooperatively training deep models, while typically struggling with noisy labels. Advanced works propose to tackle label noise by a reweighting strategy with a strong assumption, i.e., mild label noise. However, it may be violated in many real-world FL scenarios because of highly contaminated clients, resulting in extreme noise ratios, e.g., >90%. To tackle extremely noisy clients, we study the robustness of the re-weighting strategy, showing a pessimistic conclusion: minimizing the weight of clients trained over noisy data outperforms re-weighting strategies. To leverage models trained on noisy clients, we propose a novel approach, called negative distillation (FedNed). FedNed first identifies noisy clients and employs rather than discards the noisy clients in a knowledge distillation manner. In particular, clients identified as noisy ones are required to train models using noisy labels and pseudo-labels obtained by global models. The model trained on noisy labels serves as a ‘bad teacher’ in knowledge distillation, aiming to decrease the risk of providing incorrect information. Meanwhile, the model trained on pseudo-labels is involved in model aggregation if not identified as a noisy client. Consequently, through pseudo-labeling, FedNed gradually increases the trustworthiness of models trained on noisy clients, while leveraging all clients for model aggregation through negative distillation. To verify the efficacy of FedNed, we conduct extensive experiments under various settings, demonstrating that FedNed can consistently outperform baselines and achieve state-of-the-art performance.



Paperid:1574
Authors:Yiding Lu, Yijie Lin, Mouxing Yang, Dezhong Peng, Peng Hu, Xi Peng
College of Computer Science, Sichuan Univerisity, College of Computer Science, Sichuan Univerisity, College of Computer Science, Sichuan Univerisity, College of Computer Science, Sichuan Univerisity, College of Computer Science, Sichuan Univerisity, College of Computer Science, Sichuan Univerisity
Abstract:
In recent, some robust contrastive multiview clustering (MvC) methods have been proposed, which construct data pairs from neighborhoods to alleviate the false negative issue, i.e., some intra-cluster samples are wrongly treated as negative pairs. Although promising performance has been achieved by these methods, the false negative issue is still far from addressed and the false positive issue emerges because all in- and out-of-neighborhood samples are simply treated as positive and negative, respectively. To address the issues, we propose a novel robust method, dubbed decoupled contrastive multi-view clustering with high-order random walks (DIVIDE). In brief, DIVIDE leverages random walks to progressively identify data pairs in a global instead of local manner. As a result, DIVIDE could identify in-neighborhood negatives and out-of-neighborhood positives. Moreover, DIVIDE embraces a novel MvC architecture to perform inter- and intra-view contrastive learning in different embedding spaces, thus boosting clustering performance and embracing the robustness against missing views. To verify the efficacy of DIVIDE, we carry out extensive experiments on four benchmark datasets comparing with nine state-of-the-art MvC methods in both complete and incomplete MvC settings. The code is released on https://github.com/XLearning-SCU/2024-AAAI-DIVIDE.



Paperid:1575
Authors:Yongfan Lu, Bingdong Li, Aimin Zhou
East China Normal University, East China Normal University, East China Normal University
Abstract:
Optimizing multiple conflicting blackbox objectives simultaneously is a prevalent occurrence in many real-world applications, such as neural architecture search, and machine learning. These problems are known as expensive multi-objective optimization problems (EMOPs) when the function evaluations are computationally or financially costly. Multi-objective Bayesian optimization (MOBO) offers an efficient approach to discovering a set of Pareto optimal solutions. However, the data deficiency issue caused by limited function evaluations has posed a great challenge to current optimization methods. Moreover, most current methods tend to prioritize the quality of candidate solutions, while ignoring the quantity of promising samples. In order to tackle these issues, our paper proposes a novel multi-objective Bayesian optimization algorithm with a data augmentation strategy that provides ample high-quality samples for Pareto set learning (PSL). Specifically, it utilizes Generative Adversarial Networks (GANs) to enrich data and a dominance prediction model to screen out high-quality samples, mitigating the predicament of limited function evaluations in EMOPs. Additionally, we adopt the regularity model to expensive multi-objective Bayesian optimization for PSL. Experimental results on both synthetic and real-world problems demonstrate that our algorithm outperforms several state-of-the-art and classical algorithms.



Paperid:1576
Authors:Zhuqiang Lu, Kun Hu, Chaoyue Wang, Lei Bai, Zhiyong Wang
The University of Sydney, The University of Sydney, JD.com, Shanghai AI Laboratory, The University of Sydney
Abstract:
A 360degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by outpainting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code is available at https://github.com/zhuqiangLu/AOG-NET-360.



Paperid:1577
Authors:Nicholas Lui, Bryan Chia, William Berrios, Candace Ross, Douwe Kiela
Stanford University, Stanford University, Contextual AI, Meta AI, Stanford University Contextual AI
Abstract:
Computer vision models have been known to encode harmful biases, leading to the potentially unfair treatment of historically marginalized groups, such as people of color. However, there remains a lack of datasets balanced along demographic traits that can be used to evaluate the downstream fairness of these models. In this work, we demonstrate that diffusion models can be leveraged to create such a dataset. We first use a diffusion model to generate a large set of images depicting various occupations. Subsequently, each image is edited using inpainting to generate multiple variants, where each variant refers to a different perceived race. Using this dataset, we benchmark several visionlanguage models on a multi-class occupation classification task. We find that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. We measure a model’s downstream fairness by computing the standard deviation in the probability of predicting the true occupation label across the different identity groups. Using this fairness metric, we find significant disparities between the evaluated vision-and-language models. We hope that our work demonstrates the potential value of diffusion methods for fairness evaluations.



Paperid:1578
Authors:Chengcheng Ma, Ismail Elezi, Jiankang Deng, Weiming Dong, Changsheng Xu
Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, Huawei Noah's Ark Lab, London, UK, Huawei Noah's Ark Lab, London, UK, Institute of Automation, Chinese Academy of Sciences, Beijing, China, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Abstract:
We address the challenging problem of LongTailed Semi-Supervised Learning (LTSSL) where labeled data exhibit imbalanced class distribution and unlabeled data follow an unknown distribution. Unlike in balanced SSL, the generated pseudo-labels are skewed towards head classes, intensifying the training bias. Such a phenomenon is even amplified as more unlabeled data will be mislabeled as head classes when the class distribution of labeled and unlabeled datasets are mismatched. To solve this problem, we propose a novel method named ComPlementary Experts (CPE). Specifically, we train multiple experts to model various class distributions, each of them yielding high-quality pseudo-labels within one form of class distribution. Besides, we introduce Classwise Batch Normalization for CPE to avoid performance degradation caused by feature distribution mismatch between head and non-head classes. CPE achieves state-of-the-art performances on CIFAR-10-LT, CIFAR-100-LT, and STL-10-LT dataset benchmarks. For instance, on CIFAR-10-LT, CPE improves test accuracy by over >2.22% compared to baselines. Code is available at https://github.com/machengcheng2016/CPE-LTSSL.



Paperid:1579
Authors:Jianfei Ma
Northwestern Polytechnical University
Abstract:
Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL), aimed at efficiently assessing a policy's value function. TD(λ), a potent variant, incorporates a memory trace to distribute the prediction error into the historical context. However, this approach often neglects the significance of historical states and the relative importance of propagating the TD error, influenced by challenges such as visitation imbalance or outcome noise. To address this, we propose a novel TD algorithm named discerning TD learning (DTD), which allows flexible emphasis functions—predetermined or adapted during training—to allocate efforts effectively across states. We establish the convergence properties of our method within a specific class of emphasis functions and showcase its promising potential for adaptation to deep RL contexts. Empirical results underscore that employing a judicious emphasis function not only improves value estimation but also expedites learning across diverse scenarios.



Paperid:1580
Authors:Lianbo Ma, Yuee Zhou, Jianlun Ma, Guo Yu, Qing Li
Northeastern University, Northeastern University, Northeastern University, Nanjing Tech University, Peng Cheng Laboratory
Abstract:
Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources. Traditional lossaware quantization methods commonly use the quantized gradient to replace the full-precision gradient. However, we discover that the gradient error will lead to an unexpected zig-zagging-like issue in the gradient descent learning procedures, where the gradient directions rapidly oscillate or zig-zag, and such issue seriously slows down the model convergence. Accordingly, this paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction to defy this issue. During the gradient descent learning, a one-step forward search is designed to find the trial gradient of the next-step, which is adopted to adjust the gradient of current step towards the direction of fast convergence. After that, we backtrack the current step to update the full-precision and quantized weights through the current-step gradient and the trial gradient. A series of theoretical analysis and experiments on benchmark deep models have demonstrated the effectiveness and competitiveness of the proposed method, and our method especially outperforms others on the convergence performance.



Paperid:1581
Authors:Xiang Ma, Xuemei Li, Lexin Fang, Tianlong Zhao, Caiming Zhang
School of Software, Shandong University, Jinan 250101, China, School of Software, Shandong University, Jinan 250101, China, School of Software, Shandong University, Jinan 250101, China, School of Software, Shandong University, Jinan 250101, China, School of Software, Shandong University, Jinan 250101, China Shandong Provincial Laboratory of Future Intelligence and Financial Engineering, Yantai 264005, China
Abstract:
Time series forecasting is a crucial task in various domains. Caused by factors such as trends, seasonality, or irregular fluctuations, time series often exhibits nonstationary. It obstructs stable feature propagation through deep layers, disrupts feature distributions, and complicates learning data distribution changes. As a result, many existing models struggle to capture the underlying patterns, leading to degraded forecasting performance. In this study, we tackle the challenge of non-stationarity in time series forecasting with our proposed framework called U-Mixer. By combining Unet and Mixer, U-Mixer effectively captures local temporal dependencies between different patches and channels separately to avoid the influence of distribution variations among channels, and merge low- and high-levels features to obtain comprehensive data representations. The key contribution is a novel stationarity correction method, explicitly restoring data distribution by constraining the difference in stationarity between the data before and after model processing to restore the non-stationarity information, while ensuring the temporal dependencies are preserved. Through extensive experiments on various real-world time series datasets, U-Mixer demonstrates its effectiveness and robustness, and achieves 14.5% and 7.7% improvements over state-of-the-art (SOTA) methods.



Paperid:1582
Authors:Yingfan Ma, Xiaoyuan Luo, Kexue Fu, Manning Wang
Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China, Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Jinan, China, Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, Shanghai 200032, China Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention, Shanghai 200032, China
Abstract:
Pathological images play a vital role in clinical cancer diagnosis. Computeraided diagnosis utilized on digital Whole Slide Images (WSIs) has been widely studied. The major challenge of using deep learning models for WSI analysis is the huge size of WSI images and existing methods struggle between end-to-end learning and proper modeling of contextual information. Most state-of-the-art methods utilize a two-stage strategy, in which they use a pre-trained model to extract features of small patches cut from a WSI and then input these features into a classification model. These methods can not perform end-to-end learning and consider contextual information at the same time. To solve this problem, we propose a framework that models a WSI as a pathologist's observing video and utilizes Transformer to process video clips with a divide-and-conquer strategy, which helps achieve both context-awareness and end-to-end learning. Extensive experiments on three public WSI datasets show that our proposed method outperforms existing SOTA methods in both WSI classification and positive region detection.



Paperid:1583
Authors:Yuting Ma, Yuanzhi Yao, Xiaohua Xu
University of Science and Technology of China, Hefei University of Technology, University of Science and Technology of China
Abstract:
Federated learning (FL) has attracted growing attention since it allows for privacypreserving collaborative training on decentralized clients without explicitly uploading sensitive data to the central server. However, recent works have revealed that it still has the risk of exposing private data to adversaries. In this paper, we conduct reconstruction attacks and enhance inference attacks on various datasets to better understand that sharing trained classification model parameters to a central server is the main problem of privacy leakage in FL. To tackle this problem, a privacy-preserving image distribution sharing scheme with GAN (PPIDSG) is proposed, which consists of a block scrambling-based encryption algorithm, an image distribution sharing method, and local classification training. Specifically, our method can capture the distribution of a target image domain which is transformed by the block encryption algorithm, and upload generator parameters to avoid classifier sharing with negligible influence on model performance. Furthermore, we apply a feature extractor to motivate model utility and train it separately from the classifier. The extensive experimental results and security analyses demonstrate the superiority of our proposed scheme compared to other state-of-the-art defense methods. The code is available at https://github.com/ytingma/PPIDSG.



Paperid:1584
Authors:Louis Mahon, Thomas Lukasiewicz
University of Edinburgh, University of Oxford, Vienna University of Technology, University of Oxford
Abstract:
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. Successful existing models have employed various techniques to avoid this problem, most of which require data augmentation or which aim to make the average soft assignment across the dataset the same for each cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments. Using a Bayesian framework, we derive an intuitive optimization objective that can be straightforwardly included in the training of the encoder network. Tested on four image datasets, it consistently avoids collapse more robustly than other methods and leads to more accurate clustering. We also conduct further experiments and analyses justifying our choice to regularize the hard cluster assignments. Code is available at https://github.com/Lou1sM/online_hard_clustering.



Paperid:1585
Authors:Jayesh Malaviya, Anirban Dasgupta, Rachit Chhaya
IIT, Gandhinagar, IIT, Gandhinagar, DA-IICT, Gandhinagar
Abstract:
While coresets have been growing in terms of their application, barring few exceptions, they have mostly been limited to unsupervised settings. We consider supervised classification problems, and nondecomposable evaluation measures in such settings. We show that stratified uniform sampling based coresets have excellent empirical performance that are backed by theoretical guarantees too. We focus on the F1 score and Matthews Correlation Coefficient, two widely used non-decomposable objective functions that are nontrivial to optimize for and show that uniform coresets attain a lower bound for coreset size, and have good empirical performance, comparable with ``smarter'' coreset construction strategies.



Paperid:1586
Authors:Mikołaj Małkiński, Jacek Mańdziuk
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland Faculty of Computer Science, AGH University of Krakow, Krakow, Poland
Abstract:
Visual Reasoning (AVR) comprises a wide selection of various problems similar to those used in human IQ tests. Recent years have brought dynamic progress in solving particular AVR tasks, however, in the contemporary literature AVR problems are largely dealt with in isolation, leading to highly specialized taskspecific methods. With the aim of developing universal learning systems in the AVR domain, we propose the unified model for solving Single-Choice visual Reasoning tasks (SCAR), capable of solving various single-choice AVR tasks, without making any a priori assumptions about the task structure, in particular the number and location of panels. The proposed model relies on a novel Structure-Aware dynamic Layer (SAL), which adapts its weights to the structure of the considered AVR problem. Experiments conducted on Raven's Progressive Matrices, Visual Analogy Problems, and Odd One Out problems show that SCAR (SAL-based models, in general) effectively solves diverse AVR tasks, and its performance is on par with the state-of-the-art task-specific baselines. What is more, SCAR demonstrates effective knowledge reuse in multi-task and transfer learning settings. To our knowledge, this work is the first successful attempt to construct a general single-choice AVR solver relying on self-configurable architecture and unified solving method. With this work we aim to stimulate and foster progress on task-independent research paths in the AVR domain, with the long-term goal of development of a general AVR solver.



Paperid:1587
Authors:Francesca Mandel, Ian Barnett
University of Pennsylvania, University of Pennsylvania
Abstract:
Neural networks are powerful predictive models, but they provide little insight into the nature of relationships between predictors and outcomes. Although numerous methods have been proposed to quantify the relative contributions of input features, statistical inference and hypothesis testing of feature associations remain largely unexplored. We propose a permutationbased approach to testing that uses the partial derivatives of the network output with respect to specific inputs to assess both the significance of input features and whether significant features are linearly associated with the network output. These tests, which can be flexibly applied to a variety of network architectures, enhance the explanatory power of neural networks, and combined with powerful predictive capability, extend the applicability of these models.



Paperid:1588
Authors:Davide Maran, Pierriccardo Olivieri, Francesco Emanuele Stradi, Giuseppe Urso, Nicola Gatti, Marcello Restelli
Politecnico di Milano, Politecnico di Milano, Politecnico di Milano, Politecnico di Milano, Politecnico di Milano, Politecnico di Milano
Abstract:
In this paper, we investigate the optimal online configuration of episodic Markov decision processes when the space of the possible configurations is continuous. Specifically, we study the interaction between a learner (referred to as the configurator) and an agent with a fixed, unknown policy, when the learner aims to minimize her losses by choosing transition functions in online fashion. The losses may be unrelated to the agent's rewards. This problem applies to many realworld scenarios where the learner seeks to manipulate the Markov decision process to her advantage. We study both deterministic and stochastic settings, where the losses are either fixed or sampled from an unknown probability distribution. We design two algorithms whose peculiarity is to rely on occupancy measures to explore with optimism the continuous space of transition functions, achieving constant regret in deterministic settings and sublinear regret in stochastic settings, respectively. Moreover, we prove that the regret bound is tight with respect to any constant factor in deterministic settings. Finally, we compare the empiric performance of our algorithms with a baseline in synthetic experiments.



Paperid:1589
Authors:Sascha Marton, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt
University of Mannheim, University of Rostock, University of Mannheim, University of Mannheim
Abstract:
Decision Trees (DTs) are commonly used for many machine learning tasks due to their high degree of interpretability. However, learning a DT from data is a difficult optimization problem, as it is nonconvex and non-differentiable. Therefore, common approaches learn DTs using a greedy growth algorithm that minimizes the impurity locally at each internal node. Unfortunately, this greedy procedure can lead to inaccurate trees. In this paper, we present a novel approach for learning hard, axis-aligned DTs with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation, to jointly optimize all tree parameters. Our approach outperforms existing methods on binary classification benchmarks and achieves competitive results for multi-class tasks. The implementation is available under: https://github.com/s-marton/GradTree



Paperid:1590
Authors:Jeremy McMahan, Young Wu, Xiaojin Zhu, Qiaomin Xie
University of Wisconsin-Madison, University of Wisconsin-Madison, University of Wisconsin-Madison, University of Wisconsin-Madison
Abstract:
To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceivedstate attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.



Paperid:1591
Authors:Xiangming Meng, Yoshiyuki Kabashima
Zhejiang University, The University of Tokyo
Abstract:
In practical compressed sensing (CS), the obtained measurements typically necessitate quantization to a limited number of bits prior to transmission or storage. This nonlinear quantization process poses significant recovery challenges, particularly with extreme coarse quantization such as 1bit. Recently, an efficient algorithm called QCS-SGM was proposed for quantized CS (QCS) which utilizes score-based generative models (SGM) as an implicit prior. Due to the adeptness of SGM in capturing the intricate structures of natural signals, QCS-SGM substantially outperforms previous QCS methods. However, QCS-SGM is constrained to (approximately) row-orthogonal sensing matrices as the computation of the likelihood score becomes intractable otherwise. To address this limitation, we introduce an advanced variant of QCS-SGM, termed QCS-SGM+, capable of handling general matrices effectively. The key idea is a Bayesian inference perspective on the likelihood score computation, wherein expectation propagation is employed for its approximate computation. Extensive experiments are conducted, demonstrating the substantial superiority of QCS-SGM+ over QCS-SGM for general sensing matrices beyond mere row-orthogonality.



Paperid:1592
Authors:Nicolas Michel, Giovanni Chierchia, Romain Negrel, Jean-François Bercher
Univ Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la-Vallée, France, Univ Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la-Vallée, France, Univ Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la-Vallée, France, Univ Gustave Eiffel, CNRS, LIGM, F-77454 Marne-la-Vallée, France
Abstract:
We use the maximum a posteriori estimation principle for learning representations distributed on the unit sphere. We propose to use the angular Gaussian distribution, which corresponds to a Gaussian projected on the unitsphere and derive the associated loss function. We also consider the von Mises-Fisher distribution, which is the conditional of a Gaussian in the unit-sphere. The learned representations are pushed toward fixed directions, which are the prior means of the Gaussians; allowing for a learning strategy that is resilient to data drift. This makes it suitable for online continual learning, which is the problem of training neural networks on a continuous data stream, where multiple classification tasks are presented sequentially so that data from past tasks are no longer accessible, and data from the current task can be seen only once. To address this challenging scenario, we propose a memory-based representation learning technique equipped with our new loss functions. Our approach does not require negative data or knowledge of task boundaries and performs well with smaller batch sizes while being computationally efficient. We demonstrate with extensive experiments that the proposed method outperforms the current state-of-the-art methods on both standard evaluation scenarios and realistic scenarios with blurry task boundaries. For reproducibility, we use the same training pipeline for every compared method and share the code at https://github.com/Nicolas1203/ocl-fd.



Paperid:1593
Authors:Umberto Michieli, Mete Ozay
Samsung Research UK, Samsung Research UK
Abstract:
Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific usecase, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framework. Our method, HOP, permits to hop across tasks and domains by addressing the CL problem along three directions: (i) we employ a set of adapters to generalize a large pre-trained model to unseen problems, (ii) we compute high-order moments over the distribution of embedded representations to distinguish independent and correlated statistics across different tasks and domains, (iii) we process this enriched information with auxiliary heads specialized for each end problem. Extensive experimental campaign on 4 NLP applications, 5 benchmarks and 2 CL setups demonstrates the effectiveness of our HOP.



Paperid:1594
Authors:Zeping Min, Jinfeng Bai, Chengfei Li
Peking University, TAL Education Group, TAL Education Group
Abstract:
Semisupervised learning algorithms that use pseudo-labeling have become increasingly popular for improving model performance by utilizing both labeled and unlabeled data. In this paper, we offer a fresh perspective on the selection of pseudo-labels, inspired by theoretical insights. We suggest that pseudo-labels with a high degree of local variance are more prone to inaccuracies. Based on this premise, we introduce the Local Variance Match (LVM) method, which aims to optimize the selection of pseudo-labels in semi-supervised learning (SSL) tasks. Our methodology is validated through a series of experiments on widely-used image classification datasets, such as CIFAR-10, CIFAR-100, and SVHN, spanning various labeled data quantity scenarios. The empirical findings show that the LVM method substantially outpaces current SSL techniques, achieving state-of-the-art results in many of these scenarios. For instance, we observed an error rate of 5.41% on CIFAR-10 with a single label for each class, 35.87% on CIFAR-100 when using four labels per class, and 1.94% on SVHN with four labels for each class. Notably, the standout error rate of 5.41% is less than 1% shy of the performance in a fully-supervised learning environment. In experiments on ImageNet with 100k labeled data, the LVM also reached state-of-the-art outcomes. Additionally, the efficacy of the LVM method is further validated by its stellar performance in speech recognition experiments.



Paperid:1595
Authors:Coenraad Mouton, Marthinus Wilhelmus Theunissen, Marelie H Davel
Faculty of Engineering, North-West University, South Africa Centre for Artificial Intelligence Research, South Africa South African National Space Agency, Faculty of Engineering, North-West University, South Africa Centre for Artificial Intelligence Research, South Africa, Faculty of Engineering, North-West University, South Africa Centre for Artificial Intelligence Research, South Africa National Institute for Theoretical and Computational Sciences, South Africa
Abstract:
Understanding generalization in deep neural networks is an active area of research. A promising avenue of exploration has been that of margin measurements: the shortest distance to the decision boundary for a given sample or its representation internal to the network. While margins have been shown to be correlated with the generalization ability of a model when measured at its hidden representations (hidden margins), no such link between large margins and generalization has been established for input margins. We show that while input margins are not generally predictive of generalization, they can be if the search space is appropriately constrained. We develop such a measure based on input margins, which we refer to as 'constrained margins'. The predictive power of this new measure is demonstrated on the 'Predicting Generalization in Deep Learning' (PGDL) dataset and contrasted with hidden representation margins. We find that constrained margins achieve highly competitive scores and outperform other margin measurements in general. This provides a novel insight on the relationship between generalization and classification margins, and highlights the importance of considering the data manifold for investigations of generalization in DNNs.



Paperid:1596
Authors:Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier
LMU Munich, MCML Munich, D-80539 Munich, Germany, Bielefeld University, CITEC, D-33619 Bielefeld, Germany, Bielefeld University, CITEC, D-33619 Bielefeld, Germany, LMU Munich, MCML Munich, D-80539 Munich, Germany
Abstract:
While shallow decision trees may be interpretable, larger ensemble models like gradientboosted trees, which often set the state of the art in machine learning problems involving tabular data, still remain black box models. As a remedy, the Shapley value (SV) is a well-known concept in explainable artificial intelligence (XAI) research for quantifying additive feature attributions of predictions. The model-specific TreeSHAP methodology solves the exponential complexity for retrieving exact SVs from tree-based models. Expanding beyond individual feature attribution, Shapley interactions reveal the impact of intricate feature interactions of any order. In this work, we present TreeSHAP-IQ, an efficient method to compute any-order additive Shapley interactions for predictions of tree-based models. TreeSHAP-IQ is supported by a mathematical framework that exploits polynomial arithmetic to compute the interaction scores in a single recursive traversal of the tree, akin to Linear TreeSHAP. We apply TreeSHAP-IQ on state-of-the-art tree ensembles and explore interactions on well-established benchmark datasets.



Paperid:1597
Authors:Lokesh Nagalapatti, Akshay Iyer, Abir De, Sunita Sarawagi
IIT Bombay, IIT Bombay, IIT Bombay, IIT Bombay
Abstract:
We address the Individualized continuous treatment effect (ICTE) estimation problem where we predict the effect of any continuous valued treatment on an individual using observational data. The main challenge in this estimation task is the potential confounding of treatment assignment with in- dividual’s covariates in the training data, whereas during in- ference ICTE requires prediction on independently sampled treatments. In contrast to prior work that relied on regularizers or unstable GAN training, we advocate the direct approach of augmenting training individuals with independently sam- pled treatments and inferred counterfactual outcomes. We in- fer counterfactual outcomes using a two-pronged strategy: a Gradient Interpolation for close-to-observed treatments, and a Gaussian Process based Kernel Smoothing which allows us to down weigh high variance inferences. We evaluate our method on five benchmarks and show that our method out- performs six state-of-the-art methods on the counterfactual estimation error. We analyze the superior performance of our method by showing that (1) our inferred counterfactual re- sponses are more accurate, and (2) adding them to the train- ing data reduces the distributional distance between the con- founded training distribution and test distribution where treat- ment is independent of covariates. Our proposed method is model-agnostic and we show that it improves ICTE accuracy of several existing models.



Paperid:1598
Authors:Ruiqian Nai, Zixin Wen, Ji Li, Yuanzhi Li, Yang Gao
Tsinghua University, Beijing, China Shanghai Artificial Intelligence Laboratory, Shanghai, China Shanghai Qi Zhi Institute, Shanghai, China, Carnegie Mellon University, Pittsburgh, PA, USA, Tsinghua University, Beijing, China, Carnegie Mellon University, Pittsburgh, PA, USA, Tsinghua University, Beijing, China Shanghai Artificial Intelligence Laboratory, Shanghai, China Shanghai Qi Zhi Institute, Shanghai, China
Abstract:
In representation learning, a disentangled representation is highly desirable as it encodes generative factors of data in a separable and compact pattern. Researchers have advocated leveraging disentangled representations to complete downstream tasks with encouraging empirical evidence. This paper further investigates the necessity of disentangled representation in downstream applications. Specifically, we show that dimensionwise disentangled representations are unnecessary on a fundamental downstream task, abstract visual reasoning. We provide extensive empirical evidence against the necessity of disentanglement, covering multiple datasets, representation learning methods, and downstream network architectures. Furthermore, our findings suggest that the informativeness of representations is a better indicator of downstream performance than disentanglement. Finally, the positive correlation between informativeness and disentanglement explains the claimed usefulness of disentangled representations in previous works. The source code is available at https://github.com/Richard-coder-Nai/disentanglement-lib-necessity.git



Paperid:1599
Authors:Shintaro Nakamura, Masashi Sugiyama
The University of Tokyo RIKEN AIP, RIKEN AIP The University of Tokyo
Abstract:
We study the realvalued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given stochastic arms, and the reward of each arm follows an unknown distribution. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal action from a finite-sized real-valued action set with as few arm pulls as possible. Previous methods in the R-CPE-MAB require enumerating all of the feasible actions of the combinatorial optimization problem one is considering. In general, since the size of the action set grows exponentially large with respect to the number of arms, this is almost practically impossible when the number of arms is large. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large with respect to the number of arms. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.



Paperid:1600
Authors:Md Nasim, Yexiang Xue
Purdue University, Purdue University
Abstract:
Accelerating the learning of Partial Differential Equations (PDEs) from experimental data will speed up the pace of scientific discovery. Previous randomized algorithms exploit sparsity in PDE updates for acceleration. However such methods are applicable to a limited class of decomposable PDEs, which have sparse features in the value domain. We propose Reel, which accelerates the learning of PDEs via random projection and has much broader applicability. Reel exploits the sparsity by decomposing dense updates into sparse ones in both the value and frequency domains. This decomposition enables efficient learning when the source of the updates consists of gradually changing terms across large areas (sparse in the frequency domain) in addition to a few rapid updates concentrated in a small set of “interfacial” regions (sparse in the value domain). Random projection is then applied to compress the sparse signals for learning. To expand the model applicability, Taylor series expansion is used in Reel to approximate the nonlinear PDE updates with polynomials in the decomposable form. Theoretically, we derive a constant factor approximation between the projected loss function and the original one with polylogarithmic number of projected dimensions. Experimentally, we provide empirical evidence that our proposed Reel can lead to faster learning of PDE models (70-98% reduction in training time when the data is compressed to 1% of its original size) with comparable quality as the non-compressed models.



Paperid:1601
Authors:Adil Nawaz, Guopeng Chen, Muhammad Umair Raza, Zahid Iqbal, Jianqiang Li, Victor C.M. Leung, Jie Chen
Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University
Abstract:
Distributed sparse Gaussian process (dGP) models provide an ability to achieve accurate predictive performance using data from multiple devices in a time efficient and scalable manner. The distributed computation of model, however, risks exposure of privately owned data to public manipulation. In this paper we propose a secure solution for dGP regression models using multikey homomorphic encryption. Experimental results show that with a little sacrifice in terms of time complexity, we achieve a secure dGP model without deteriorating the predictive performance compared to traditional non-secure dGP models. We also present a practical implementation of the proposed model using several Nvidia Jetson Nano Developer Kit modules to simulate a real-world scenario. Thus, secure dGP model plugs the data security issues of dGP and provide a secure and trustworthy solution for multiple devices to use privately owned data for model computation in a distributed environment availing speed, scalability and robustness of dGP.



Paperid:1602
Authors:David D. Nguyen, David Liebowitz, Salil S. Kanhere, Surya Nepal
UNSW Sydney CSIRO Data61 Cybersecurity CRC, Penten UNSW Sydney, UNSW Sydney Cybersecurity CRC, CSIRO Data61 Cybersecurity CRC
Abstract:
In many realworld applications, from robotics to pedestrian trajectory prediction, there is a need to predict multiple real-valued outputs to represent several potential scenarios. Current deep learning techniques to address multiple-output problems are based on two main methodologies: (1) mixture density networks, which suffer from poor stability at high dimensions, or (2) multiple choice learning (MCL), an approach that uses M single-output functions, each only producing a point estimate hypothesis. This paper presents a Mixture of Multiple-Output functions (MoM) approach using a novel variant of dropout, Multiple Hypothesis Dropout. Unlike traditional MCL-based approaches, each multiple-output function not only estimates the mean but also the variance for its hypothesis. This is achieved through a novel stochastic winner-take-all loss which allows each multiple-output function to estimate variance through the spread of its subnetwork predictions. Experiments on supervised learning problems illustrate that our approach outperforms existing solutions for reconstructing multimodal output distributions. Additional studies on unsupervised learning problems show that estimating the parameters of latent posterior distributions within a discrete autoencoder significantly improves codebook efficiency, sample quality, precision and recall.



Paperid:1603
Authors:Viet Nguyen, Giang Vu, Tung Nguyen Thanh, Khoat Than, Toan Tran
VinAI Research, Vietnam, Hanoi University of Science and Technology, Vietnam Viettel Group, Vietnam, Hanoi University of Science and Technology, Vietnam Viettel Group, Vietnam, Hanoi University of Science and Technology, Vietnam, VinAI Research, Vietnam
Abstract:
Denoising Probabilistic Models (DPMs) represent an emerging domain of generative models that excel in generating diverse and highquality images. However, most current training methods for DPMs often neglect the correlation between timesteps, limiting the model's performance in generating images effectively. Notably, we theoretically point out that this issue can be caused by the cumulative estimation gap between the predicted and the actual trajectory. To minimize that gap, we propose a novel sequence-aware loss that aims to reduce the estimation gap to enhance the sampling quality. Furthermore, we theoretically show that our proposed loss function is a tighter upper bound of the estimation loss in comparison with the conventional loss in DPMs. Experimental results on several benchmark datasets including CIFAR10, CelebA, and CelebA-HQ consistently show a remarkable improvement of our proposed method regarding the image generalization quality measured by FID and Inception Score compared to several DPM baselines. Our code and pre-trained checkpoints are available at https://github.com/VinAIResearch/SA-DPM.



Paperid:1604
Authors:Buqing Nie, Jingtian Ji, Yangqing Fu, Yue Gao
MoE Key Lab of Artificial Intelligence and AI Institute, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence and AI Institute, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence and AI Institute, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence and AI Institute, Shanghai Jiao Tong University
Abstract:
Deep Reinforcement Learning (DRL) has achieved remarkable advances in sequential decision tasks. However, recent works have revealed that DRL agents are susceptible to slight perturbations in observations. This vulnerability raises concerns regarding the effectiveness and robustness of deploying such agents in realworld applications. In this work, we propose a novel robust reinforcement learning method called SortRL, which improves the robustness of DRL policies against observation perturbations from the perspective of the network architecture. We employ a novel architecture for the policy network that incorporates global $l_\infty$ Lipschitz continuity and provide a convenient method to enhance policy robustness based on the output margin. Besides, a training framework is designed for SortRL, which solves given tasks while maintaining robustness against $l_\infty$ bounded perturbations on the observations. Several experiments are conducted to evaluate the effectiveness of our method, including classic control tasks and video games. The results demonstrate that SortRL achieves state-of-the-art robustness performance against different perturbation strength.



Paperid:1605
Authors:Feiping Nie, Zhezheng Hao, Rong Wang
Northwestern Polytechnical University, Northwestern Polytechnical University, Northwestern Polytechnical University
Abstract:
Support Vector Machine (SVM) stands out as a prominent machine learning technique widely applied in practical pattern recognition tasks. It achieves binary classification by maximizing the "margin", which represents the minimum distance between instances and the decision boundary. Although many efforts have been dedicated to expanding SVM for multiclass case through strategies such as one versus one and one versus the rest, satisfactory solutions remain to be developed. In this paper, we propose a novel method for multi-class SVM that incorporates pairwise class loss considerations and maximizes the minimum margin. Adhering to this concept, we embrace a new formulation that imparts heightened flexibility to multi-class SVM. Furthermore, the correlations between the proposed method and multiple forms of multi-class SVM are analyzed. The proposed regularizer, akin to the concept of "margin", can serve as a seamless enhancement over the softmax in deep learning, providing guidance for network parameter learning. Empirical evaluations demonstrate the effectiveness and superiority of our proposed method over existing multi-classification methods. Complete version is available at https://arxiv.org/pdf/2312.06578.pdf. Code is available at https://github.com/zz-haooo/M3SVM.



Paperid:1606
Authors:Motoki Omura, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada
The University of Tokyo, The University of Tokyo RIKEN, The University of Tokyo RIKEN, The University of Tokyo RIKEN
Abstract:
In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Qlearning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.



Paperid:1607
Authors:Tom Overman, Garrett Blum, Diego Klabjan
Northwestern University, Northwestern University, Northwestern University
Abstract:
Very few methods for hybrid federated learning, where clients only hold subsets of both features and samples, exist. Yet, this scenario is very important in practical settings. We provide a fast, robust algorithm for hybrid federated learning that hinges on Fenchel Duality. We prove the convergence of the algorithm to the same solution as if the model was trained centrally in a variety of practical regimes. Furthermore, we provide experimental results that demonstrate the performance improvements of the algorithm over a commonly used method in federated learning, FedAvg, and an existing hybrid FL algorithm, HyFEM. We also provide privacy considerations and necessary steps to protect client data.



Paperid:1608
Authors:Ryota Ozaki, Kazuki Ishikawa, Youhei Kanzaki, Shion Takeno, Ichiro Takeuchi, Masayuki Karasuyama
Nagoya Institute of Technology, Nagoya Institute of Technology, Nagoya Institute of Technology, RIKEN AIP, Nagoya University RIKEN AIP, Nagoya Institute of Technology
Abstract:
There are a lot of realworld black-box optimization problems that need to optimize multiple criteria simultaneously. However, in a multi-objective optimization (MOO) problem, identifying the whole Pareto front requires the prohibitive search cost, while in many practical scenarios, the decision maker (DM) only needs a specific solution among the set of the Pareto optimal solutions. We propose a Bayesian optimization (BO) approach to identifying the most preferred solution in the MOO with expensive objective functions, in which a Bayesian preference model of the DM is adaptively estimated by an interactive manner based on the two types of supervisions called the pairwise preference and improvement request. To explore the most preferred solution, we define an acquisition function in which the uncertainty both in the objective function and the DM preference is incorporated. Further, to minimize the interaction cost with the DM, we also propose an active learning strategy for the preference estimation. We empirically demonstrate the effectiveness of our proposed method through the benchmark function optimization and the hyper-parameter optimization problems for machine learning models.



Paperid:1609
Authors:Chenglu Pan, Jiarong Xu, Yue Yu, Ziqi Yang, Qingbiao Wu, Chunping Wang, Lei Chen, Yang Yang
Zhejiang University Fudan University, Fudan University, Fudan University, Zhejiang University ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, FinVolution Group, FinVolution Group, Zhejiang University
Abstract:
Graph federated learning (FL) has emerged as a pivotal paradigm enabling multiple agents to collaboratively train a graph model while preserving local data privacy. Yet, current efforts overlook a key issue: agents are selfinterested and would hesitant to share data without fair and satisfactory incentives. This paper is the first endeavor to address this issue by studying the incentive mechanism for graph federated learning. We identify a unique phenomenon in graph federated learning: the presence of agents posing potential harm to the federation and agents contributing with delays. This stands in contrast to previous FL incentive mechanisms that assume all agents contribute positively and in a timely manner. In view of this, this paper presents a novel incentive mechanism tailored for fair graph federated learning, integrating incentives derived from both model gradient and payoff. To achieve this, we first introduce an agent valuation function aimed at quantifying agent contributions through the introduction of two criteria: gradient alignment and graph diversity. Moreover, due to the high heterogeneity in graph federated learning, striking a balance between accuracy and fairness becomes particularly crucial. We introduce motif prototypes to enhance accuracy, communicated between the server and agents, enhancing global model aggregation and aiding agents in local model optimization. Extensive experiments show that our model achieves the best trade-off between accuracy and the fairness of model gradient, as well as superior payoff fairness.



Paperid:1610
Authors:Liming Pan, Cheng Shi, Ivan Dokmanic
University of Science and Technology of China Nanjing Normal University, University of Basel, University of Basel
Abstract:
Relational inference aims to identify interactions between parts of a dynamical system from the observed dynamics. Current stateof-the-art methods fit the dynamics with a graph neural network (GNN) on a learnable graph. They use one-step message-passing GNNs---intuitively the right choice since non-locality of multi-step or spectral GNNs may confuse direct and indirect interactions. But the effective interaction graph depends on the sampling rate and it is rarely localized to direct neighbors, leading to poor local optima for the one-step model. In this work, we propose a graph dynamics prior (GDP) for relational inference. GDP constructively uses error amplification in non-local polynomial filters to steer the solution to the ground-truth graph. To deal with non-uniqueness, GDP simultaneously fits a ``shallow'' one-step model and a polynomial multi-step model with shared graph topology. Experiments show that GDP reconstructs graphs far more accurately than earlier methods, with remarkable robustness to under-sampling. Since appropriate sampling rates for unknown dynamical systems are not known a priori, this robustness makes GDP suitable for real applications in scientific machine learning. Reproducible code is available at https://github.com/DaDaCheng/GDP.



Paperid:1611
Authors:Zherong Pan, Xifeng Gao, Kui Wu
Lightspeed Studios, Lightspeed Studios, Lightspeed Studios
Abstract:
Predicting the state evolution of ultra highdimensional, time-reversible fluid dynamic systems is a crucial but computationally expensive task. Existing physics-informed neural networks either incur high inference cost or cannot preserve the time-reversible nature of the underlying dynamics system. We propose a model-based approach to identify low-dimensional, time reversible, nonlinear fluid dynamic systems. Our method utilizes the symplectic structure of reduced Eulerian fluid and use stochastic Riemann optimization to obtain a low-dimensional bases that minimize the expected trajectory-wise dimension-reduction error over a given distribution of initial conditions. We show that such minimization is well-defined since the reduced trajectories are differentiable with respect to the subspace bases over the entire Grassmannian manifold, under proper choices of timestep sizes and numerical integrators. Finally, we propose a loss function measuring the trajectory-wise discrepancy between the original and reduced models. By tensor precomputation, we show that gradient information of such loss function can be evaluated efficiently over a long trajectory without time-integrating the high-dimensional dynamic system. Through evaluations on a row of simulation benchmarks, we show that our method reduces the discrepancy by 50-90 percent over conventional reduced models and we outperform PINNs by exactly preserving the time reversibility.



Paperid:1612
Authors:Zibin Pan, Chi Li, Fangchen Yu, Shuyi Wang, Haijin Wang, Xiaoying Tang, Junhua Zhao
The Chinese University of Hong Kong, Shenzhen The Shenzhen Institute of Artificial Intelligence and Robotics for Society, The Chinese University of Hong Kong, Shenzhen Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, The Chinese University of Hong Kong, Shenzhen The Shenzhen Institute of Artificial Intelligence and Robotics for Society, The Chinese University of Hong Kong, Shenzhen The Shenzhen Institute of Artificial Intelligence and Robotics for Society, The Chinese University of Hong Kong, Shenzhen The Shenzhen Institute of Artificial Intelligence and Robotics for Society The Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong, Shenzhen The Shenzhen Institute of Artificial Intelligence and Robotics for Society
Abstract:
Fairness has become an important concern in Federated Learning (FL). An unfair model that performs well for some clients while performing poorly for others can reduce the willingness of clients to participate. In this work, we identify a direct cause of unfairness in FL the use of an unfair direction to update the global model, which favors some clients while conflicting with other clients’ gradients at the model and layer levels. To address these issues, we propose a layer-wise fair Federated Learning algorithm (FedLF). Firstly, we formulate a multi-objective optimization problem with an effective fair-driven objective for FL. A layer-wise fair direction is then calculated to mitigate the model and layer-level gradient conflicts and reduce the improvement bias. We further provide the theoretical analysis on how FedLF can improve fairness and guarantee convergence. Extensive experiments on different learning tasks and models demonstrate that FedLF outperforms the SOTA FL algorithms in terms of accuracy and fairness. The source code is available at https://github.com/zibinpan/FedLF.



Paperid:1613
Authors:Andreas Papachristodoulou, Christos Kyrkou, Stelios Timotheou, Theocharis Theocharides
KIOS Research and Innovation Center of Excellence, Department of Electrical and Computer Engineering, University of Cyprus, KIOS Research and Innovation Center of Excellence, Department of Electrical and Computer Engineering, University of Cyprus, KIOS Research and Innovation Center of Excellence, Department of Electrical and Computer Engineering, University of Cyprus, KIOS Research and Innovation Center of Excellence, Department of Electrical and Computer Engineering, University of Cyprus
Abstract:
The ForwardForward (FF) Algorithm has been recently proposed to alleviate the issues of backpropagation (BP) commonly used to train deep neural networks. However, its current formulation exhibits limitations such as the generation of negative data, slower convergence, and inadequate performance on complex tasks. In this paper we take the main ideas of FF and improve them by leveraging channel-wise competitive learning in the context of convolutional neural networks for image classification tasks. A layer-wise loss function is introduced that promotes competitive learning and eliminates the need for negative data construction. To enhance both the learning of compositional features and feature space partitioning, a channel-wise feature separator and extractor block is proposed that complements the competitive learning process. Our method outperforms recent FF-based models on image classification tasks, achieving testing errors of 0.58%, 7.69%, 21.89%, and 48.77% on MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100 respectively. Our approach bridges the performance gap between FF learning and BP methods, indicating the potential of our proposed approach to learn useful representations in a layer-wise modular fashion, enabling more efficient and flexible learning. Our source code and supplementary material are available at https://github.com/andreaspapac/CwComp.



Paperid:1614
Authors:Mincheol Park, Dongjin Kim, Cheonjun Park, Yuna Park, Gyeong Eun Gong, Won Woo Ro, Suhyun Kim
Yonsei University Korea Institute of Science and Technology, Korea University Korea Institute of Science and Technology, Yonsei University, Korea Institute of Science and Technology, Hyundai MOBIS, Yonsei University, Korea Institute of Science and Technology
Abstract:
Channel pruning is widely accepted to accelerate modern convolutional neural networks (CNNs). The resulting pruned model benefits from its immediate deployment on generalpurpose software and hardware resources. However, its large pruning granularity, specifically at the unit of a convolution filter, often leads to undesirable accuracy drops due to the inflexibility of deciding how and where to introduce sparsity to the CNNs. In this paper, we propose REPrune, a novel channel pruning technique that emulates kernel pruning, fully exploiting the finer but structured granularity. REPrune identifies similar kernels within each channel using agglomerative clustering. Then, it selects filters that maximize the incorporation of kernel representatives while optimizing the maximum cluster coverage problem. By integrating with a simultaneous training-pruning paradigm, REPrune promotes efficient, progressive pruning throughout training CNNs, avoiding the conventional train-prune-finetune sequence. Experimental results highlight that REPrune performs better in computer vision tasks than existing methods, effectively achieving a balance between acceleration ratio and performance retention.



Paperid:1615
Authors:Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang
Arizona State University, University of Maryland Baltimore County, Arizona State University, Arizona State University
Abstract:
The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in textto-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/



Paperid:1616
Authors:Sagar Patel, Sangeetha Abdu Jyothi, Nina Narodytska
University of California, Irvine, University of California, Irvine VMware Research, VMware Research
Abstract:
We present CrystalBox, a novel, modelagnostic, posthoc explainability framework for Deep Reinforcement Learning (DRL) controllers in the large family of input-driven environments which includes computer systems. We combine the natural decomposability of reward functions in input-driven environments with the explanatory power of decomposed returns. We propose an efficient algorithm to generate future-based explanations across both discrete and continuous control environments. Using applications such as adaptive bitrate streaming and congestion control, we demonstrate CrystalBox's capability to generate high-fidelity explanations. We further illustrate its higher utility across three practical use cases: contrastive explanations, network observability, and guided reward design, as opposed to prior explainability techniques that identify salient features.



Paperid:1617
Authors:Hongbin Pei, Taile Chen, Chen A, Huiqi Deng, Jing Tao, Pinghui Wang, Xiaohong Guan
MOE KLINNS Lab, Xi'an Jiaotong University, China, MOE KLINNS Lab, Xi'an Jiaotong University, China, MOE KLINNS Lab, Xi'an Jiaotong University, China, Shanghai Jiao Tong University, China, MOE KLINNS Lab, Xi'an Jiaotong University, China, MOE KLINNS Lab, Xi'an Jiaotong University, China, MOE KLINNS Lab, Xi'an Jiaotong University, China
Abstract:
Molecular representation learning has emerged as a gamechanger at the intersection of AI and chemistry, with great potential in applications such as drug design and materials discovery. A substantial obstacle in successfully applying molecular representation learning is the difficulty of effectively and completely characterizing and learning molecular geometry, which has not been well addressed to date. To overcome this challenge, we propose a novel framework that features a novel geometric graph, termed HAGO-Graph, and a specifically designed geometric graph learning model, HAGO-Net. In the framework, the foundation is HAGO-Graph, which enables a complete characterization of molecular geometry in a hierarchical manner. Specifically, we leverage the concept of n-body in physics to characterize geometric patterns at multiple spatial scales. We then specifically design a message passing scheme, HAGO-MPS, and implement the scheme as a geometric graph neural network, HAGO-Net, to effectively learn the representation of HAGO-Graph by horizontal and vertical aggregation. We further prove DHAGO-Net, the derivative function of HAGO-Net, is an equivariant model. The proposed models are validated by extensive comparisons on four challenging benchmarks. Notably, the models exhibited state-of-the-art performance in molecular chirality identification and property prediction, achieving state-of-the-art performance on five properties of QM9 dataset. The models also achieved competitive results on molecular dynamics prediction task.



Paperid:1618
Authors:Cheng Peng, Ke Chen, Lidan Shou, Gang Chen
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, The State Key Laboratory of Blockchain and Data Security, Zhejiang University, The State Key Laboratory of Blockchain and Data Security, Zhejiang University, The State Key Laboratory of Blockchain and Data Security, Zhejiang University
Abstract:
Multimodal multi-label emotion recognition (MMER) aims to identify relevant emotions from multiple modalities. The challenge of MMER is how to effectively capture discriminative features for multiple labels from heterogeneous data. Recent studies are mainly devoted to exploring various fusion strategies to integrate multi-modal information into a unified representation for all labels. However, such a learning scheme not only overlooks the specificity of each modality but also fails to capture individual discriminative features for different labels. Moreover, dependencies of labels and modalities cannot be effectively modeled. To address these issues, this paper presents ContrAstive feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically, we devise a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies by contrastively learning modal-separated and label-specific features. To further exploit the modality complementarity, we introduce a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code is available at https://github.com/chengzju/CARAT.



Paperid:1619
Authors:Cheng Peng, Ke Chen, Lidan Shou, Gang Chen
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, The State Key Laboratory of Blockchain and Data Security, Zhejiang University, The State Key Laboratory of Blockchain and Data Security, Zhejiang University, The State Key Laboratory of Blockchain and Data Security, Zhejiang University
Abstract:
Multilabel few-shot aspect category detection (FS-ACD) is a challenging sentiment analysis task, which aims to learn a multi-label learning paradigm with limited training data. The difficulty of this task is how to use limited data to generalize effective discriminative representations for different categories. Nowadays, all advanced FS-ACD works utilize the prototypical network to learn label prototypes to represent different aspects. However, such point-based estimation methods are inherently noise-susceptible and bias-vulnerable. To this end, this paper proposes a novel Variational Hybrid-Attention Framework (VHAF) for the FS-ACD task. Specifically, to alleviate the data noise, we adopt a hybrid-attention mechanism to generate more discriminative aspect-specific embeddings. Then, based on these embeddings, we introduce the variational distribution inference to obtain the aspect-specific distribution as a more robust aspect representation, which can eliminate the scarce data bias for better inference. Moreover, we further leverage an adaptive threshold estimation to help VHAF better identify multiple relevant aspects. Extensive experiments on three datasets demonstrate the effectiveness of our VHAF over other state-of-the-art methods. Code is available at https://github.com/chengzju/VHAF.



Paperid:1620
Authors:Shaohui Peng, Xing Hu, Qi Yi, Rui Zhang, Jiaming Guo, Di Huang, Zikang Tian, Ruizhi Chen, Zidong Du, Qi Guo, Yunji Chen, Ling Li
The Institute of Software, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, University of Science and Technology of China, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, ISCAS,China, Institute of Computing Technology of the Chinese Academy of Sciences, Institute of Computing technology, Institute of Computing Technology, Chinese Academy of Sciences, The Institute of Software, Chinese Academy of Sciences
Abstract:
Large language models (LLMs) show their powerful automatic reasoning and planning capability with a wealth of semantic knowledge about the human world. However, the grounding problem still hinders the applications of LLMs in the realworld environment. Existing studies try to fine-tune the LLM or utilize pre-defined behavior APIs to bridge the LLMs and the environment, which not only costs huge human efforts to customize for every single task but also weakens the generality strengths of LLMs. To autonomously ground the LLM onto the environment, we proposed the Hypothesis, Verification, and Induction (HYVIN) framework to automatically and progressively ground the LLM with self-driven skill learning. HYVIN first employs the LLM to propose the hypothesis of sub-goals to achieve tasks and then verify the feasibility of the hypothesis via interacting with the underlying environment. Once verified, HYVIN can then learn generalized skills with the guidance of these successfully grounded subgoals. These skills can be further utilized to accomplish more complex tasks that fail to pass the verification phase. Verified in the famous instruction following task set, BabyAI, HYVIN achieves comparable performance in the most challenging tasks compared with imitation learning methods that cost millions of demonstrations, proving the effectiveness of learned skills and showing the feasibility and efficiency of our framework.



Paperid:1621
Authors:Maximilian Pflueger, David Tena Cucala, Egor V. Kostylev
University of Oxford, University of Oxford, University of Oslo
Abstract:
The success of Graph Neural Networks (GNNs) in practice has motivated extensive research on their theoretical properties. This includes recent results that characterise node classifiers expressible by GNNs in terms of first order logic. Most of the analysis, however, has been focused on GNNs with fixed number of messagepassing iterations (i.e., layers), which cannot realise many simple classifiers such as reachability of a node with a given label. In this paper, we start to fill this gap and study the foundations of GNNs that can perform more than a fixed number of message-passing iterations. We first formalise two generalisations of the basic GNNs: recurrent GNNs (RecGNNs), which repeatedly apply message-passing iterations until the node classifications become stable, and graph-size GNNs (GSGNNs), which exploit a built-in function of the input graph size to decide the number of message-passings. We then formally prove that GNN classifiers are strictly less expressive than RecGNN ones, and RecGNN classifiers are strictly less expressive than GSGNN ones. To get this result, we identify novel semantic characterisations of the three formalisms in terms of suitable variants of bisimulation, which we believe have their own value for our understanding of GNNs. Finally, we prove syntactic logical characterisations of RecGNNs and GSGNNs analogous to the logical characterisation of plain GNNs, where we connect the two formalisms to monadic monotone fixpoint logic---a generalisation of first-order logic that supports recursion.



Paperid:1622
Authors:Vincent Pisztora, Jia Li
Pennsylvania State University, Pennsylvania State University
Abstract:
In this paper we propose a method for the optimal allocation of observations between an intrinsically explainable glass box model and a black box model. An optimal allocation being defined as one which, for any given explainability level (i.e. the proportion of observations for which the explainable model is the prediction function), maximizes the performance of the ensemble on the underlying task, and maximizes performance of the explainable model on the observations allocated to it, subject to the maximal ensemble performance condition. The proposed method is shown to produce such explainability optimal allocations on a benchmark suite of tabular datasets across a variety of explainable and black box model types. These learned allocations are found to consistently maintain ensemble performance at very high explainability levels (explaining 74% of observations on average), and in some cases even outperform both the component explainable and black box models while improving explainability.



Paperid:1623
Authors:Drago Plecko, Elias Bareinboim
Columbia University, Columbia University
Abstract:
Since the rise of fair machine learning as a critical field of inquiry, many different notions on how to quantify and measure discrimination have been proposed in the literature. Some of these notions, however, were shown to be mutually incompatible. Such findings make it appear that numerous different kinds of fairness exist, thereby making a consensus on the appropriate measure of fairness harder to reach, hindering the applications of these tools in practice. In this paper, we investigate one of these key impossibility results that relates the notions of statistical and predictive parity. Specifically, we derive a new causal decomposition formula for the fairness measures associated with predictive parity, and obtain a novel insight into how this criterion is related to statistical parity through the legal doctrines of disparate treatment, disparate impact, and the notion of business necessity. Our results show that through a more careful causal analysis, the notions of statistical and predictive parity are not really mutually exclusive, but complementary and spanning a spectrum of fairness notions through the concept of business necessity. Finally, we demonstrate the importance of our findings on a realworld example.



Paperid:1624
Authors:Jingyu Pu, Chenhang Cui, Xinyue Chen, Yazhou Ren, Xiaorong Pu, Zhifeng Hao, Philip S. Yu, Lifang He
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, Shantou University, University of Illinois Chicago, Lehigh University
Abstract:
In recent years, incomplete multiview clustering (IMVC), which studies the challenging multi-view clustering problem on missing views, has received growing research interests. Previous IMVC methods suffer from the following issues: (1) the inaccurate imputation for missing data, which leads to suboptimal clustering performance, and (2) most existing IMVC models merely consider the explicit presence of graph structure in data, ignoring the fact that latent graphs of different views also provide valuable information for the clustering task. To overcome such challenges, we present a novel method, termed Adaptive feature imputation with latent graph for incomplete multi-view clustering (AGDIMC). Specifically, it captures the embbedded features of each view by incorporating the view-specific deep encoders. Then, we construct partial latent graphs on complete data, which can consolidate the intrinsic relationships within each view while preserving the topological information. With the aim of estimating the missing sample based on the available information, we utilize an adaptive imputation layer to impute the embedded feature of missing data by using cross-view soft cluster assignments and global cluster centroids. As the imputation progresses, the portion of complete data increases, contributing to enhancing the discriminative information contained in global pseudo-labels. Meanwhile, to alleviate the negative impact caused by inferior impute samples and the discrepancy of cluster structures, we further design an adaptive imputation strategy based on the global pseudo-label and the local cluster assignment. Experimental results on multiple real-world datasets demonstrate the effectiveness of our method over existing approaches.



Paperid:1625
Authors:Hao Qian, Hongting Zhou, Qian Zhao, Hao Chen, Hongxiang Yao, Jingwei Wang, Ziqi Liu, Fei Yu, Zhiqiang Zhang, Jun Zhou
Ant Group, Hangzhou, China, Ant Group, Hangzhou, China, Ant Group, Hangzhou, China, Ant Group, Hangzhou, China, Alibaba Group, Hangzhou, China, Ant Group, Hangzhou, China, Ant Group, Hangzhou, China, Ant Group, Hangzhou, China, Ant Group, Hangzhou, China, Ant Group, Hangzhou, China
Abstract:
The stock market is a crucial component of the financial system, but predicting the movement of stock prices is challenging due to the dynamic and intricate relations arising from various aspects such as economic indicators, financial reports, global news, and investor sentiment. Traditional sequential methods and graphbased models have been applied in stock movement prediction, but they have limitations in capturing the multifaceted and temporal influences in stock price movements. To address these challenges, the Multi-relational Dynamic Graph Neural Network (MDGNN) framework is proposed, which utilizes a discrete dynamic graph to comprehensively capture multifaceted relations among stocks and their evolution over time. The representation generated from the graph offers a complete perspective on the interrelationships among stocks and associated entities. Additionally, the power of the Transformer structure is leveraged to encode the temporal evolution of multiplex relations, providing a dynamic and effective approach to predicting stock investment. Further, our proposed MDGNN framework achieves the best performance in public datasets compared with the state-of-the-art stock investment methods.



Paperid:1626
Authors:Wei Qian, Chenxu Zhao, Yangyi Li, Fenglong Ma, Chao Zhang, Mengdi Huai
Iowa State University, Iowa State University, Iowa State University, Pennsylvania State University, Georgia Institute of Technology, Iowa Sate University
Abstract:
Despite the recent progress in deep neural networks (DNNs), it remains challenging to explain the predictions made by DNNs. Existing explanation methods for DNNs mainly focus on posthoc explanations where another explanatory model is employed to provide explanations. The fact that post-hoc methods can fail to reveal the actual original reasoning process of DNNs raises the need to build DNNs with built-in interpretability. Motivated by this, many self-explaining neural networks have been proposed to generate not only accurate predictions but also clear and intuitive insights into why a particular decision was made. However, existing self-explaining networks are limited in providing distribution-free uncertainty quantification for the two simultaneously generated prediction outcomes (i.e., a sample's final prediction and its corresponding explanations for interpreting that prediction). Importantly, they also fail to establish a connection between the confidence values assigned to the generated explanations in the interpretation layer and those allocated to the final predictions in the ultimate prediction layer. To tackle the aforementioned challenges, in this paper, we design a novel uncertainty modeling framework for self-explaining networks, which not only demonstrates strong distribution-free uncertainty modeling performance for the generated explanations in the interpretation layer but also excels in producing efficient and effective prediction sets for the final predictions based on the informative high-level basis explanations. We perform the theoretical analysis for the proposed framework. Extensive experimental evaluation demonstrates the effectiveness of the proposed uncertainty framework.



Paperid:1627
Authors:Xiaowei Qian, Bingheng Li, Zhao Kang
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Multirelational clustering is a challenging task due to the fact that diverse semantic information conveyed in multi-layer graphs is difficult to extract and fuse. Recent methods integrate topology structure and node attribute information through graph filtering. However, they often use a low-pass filter without fully considering the correlation among multiple graphs. To overcome this drawback, we propose to learn a graph filter motivated by the theoretical analysis of Barlow Twins. We find that input with a negative semi-definite inner product provides a lower bound for Barlow Twins loss, which prevents it from reaching a better solution. We thus learn a filter that yields an upper bound for Barlow Twins. Afterward, we design a simple clustering architecture and demonstrate its state-of-the-art performance on four benchmark datasets. The source code is available at https://github.com/XweiQ/BTGF.



Paperid:1628
Authors:Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, Bo An
Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University
Abstract:
Highfrequency trading (HFT) is using computer algorithms to make trading decisions in short time scales (e.g., second-level), which is widely used in the Cryptocurrency (Crypto) market, (e.g., Bitcoin). Reinforcement learning (RL) in financial research has shown stellar performance on many quantitative trading tasks. However, most methods focus on low-frequency trading, e.g., day-level, which cannot be directly applied to HFT because of two challenges. First, RL for HFT involves dealing with extremely long trajectories (e.g., 2.4 million steps per month), which is hard to optimize and evaluate. Second, the dramatic price fluctuations and market trend changes of Crypto make existing algorithms fail to maintain satisfactory performances. To tackle these challenges, we propose an Efficient hieArchical Reinforcement learNing method for High Frequency Trading (EarnHFT), a novel three-stage hierarchical RL framework for HFT. In stage I, we compute a Q-teacher, i.e., the optimal action value based on dynamic programming, for enhancing the performance and training efficiency of second level RL agents. In stage II, we construct a pool of diverse RL agents for different market trends, distinguished by return rates, where hundreds of RL agents are trained with different preferences of return rates and only a tiny fraction of them will be selected into the pool based on their profitability. In stage III, we train a minute-level router which dynamically picks a second-level agent from the pool to achieve stable performance across different markets. Through extensive experiments in various market trends on Crypto markets in a high-fidelity simulation trading environment, we demonstrate that EarnHFT significantly outperforms 6 state-of-art baselines in 6 popular financial criteria, exceeding the runner-up by 30% in profitability.



Paperid:1629
Authors:Zhen Qin, Feiyi Chen, Chen Zhi, Xueqiang Yan, Shuiguang Deng
College of Computer Science and Technology, Zhejiang University, Hangzhou, China, College of Computer Science and Technology, Zhejiang University, Hangzhou, China, School of Software Technology, Zhejiang University, Ningbo, China, Huawei Technologies Co. Ltd., Shanghai, China, College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Abstract:
Existing approaches defend against backdoor attacks in federated learning (FL) mainly through a) mitigating the impact of infected models, or b) excluding infected models. The former negatively impacts model accuracy, while the latter usually relies on globally clear boundaries between benign and infected model updates. However, in reality, model updates can easily become mixed and scattered throughout due to the diverse distributions of local data. This work focuses on excluding infected models in FL. Unlike previous perspectives from a global view, we propose Snowball, a novel antibackdoor FL framework through bidirectional elections from an individual perspective inspired by one principle deduced by us and two principles in FL and deep learning. It is characterized by a) bottom-up election, where each candidate model update votes to several peer ones such that a few model updates are elected as selectees for aggregation; and b) top-down election, where selectees progressively enlarge themselves through picking up from the candidates. We compare Snowball with state-of-the-art defenses to backdoor attacks in FL on five real-world datasets, demonstrating its superior resistance to backdoor attacks and slight impact on the accuracy of the global model.



Paperid:1630
Authors:Jiahao Qiu, Hui Yuan, Jinghong Zhang, Wentao Chen, Huazheng Wang, Mengdi Wang
Princeton University, Princeton University, University of California San Diego, MLAB Biosciences Inc, Oregon State University, Princeton University
Abstract:
While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wildtype or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient, diversity-promoting, and able to find top designs using reasonably small mutation counts.



Paperid:1631
Authors:Liping Qiu, Qin Zhang, Xiaojun Chen, Shaotian Cai
Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University
Abstract:
Recently, the crossmodal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pretraining model could produce poor-quality pseudo labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel Multi-level Cross-modal Alignment method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.



Paperid:1632
Authors:Pengyu Qiu, Yuwen Pu, Yongchao Liu, Wenyan Liu, Yun Yue, Xiaowei Zhu, Lichun Li, Jinbao Li, Shouling Ji
Zhejiang University Ant Group, Zhejiang university, Ant Group, Ant Group Zhejiang University, Ant Group, Ant Group, Ant Group, Qilu University of Technology, Zhejiang University
Abstract:
Vertical Federated Learning (VFL) is a solution increasingly used by companies with the same user group but differing features, enabling them to collaboratively train a machine learning model. VFL ensures that clients exchange intermediate results extracted by their local models, without sharing raw data. However, in practice, VFL encounters several challenges, such as computational and communication overhead, privacy leakage risk, and adversarial attack. Our study reveals that the usage of floatingpoint (FP) numbers is a common factor causing these issues, as they can be redundant and contain too much information. To address this, we propose a new architecture called rounding layer, which converts intermediate results to integers. Our theoretical analysis and empirical results demonstrate the benefits of the rounding layer in reducing computation and memory overhead, providing privacy protection, preserving model performance, and mitigating adversarial attacks. We hope this paper inspires further research into novel architectures to address practical issues in VFL.



Paperid:1633
Authors:Yuning Qiu, Guoxu Zhou, Andong Wang, Zhenhao Huang, Qibin Zhao
School of Automation, Guangdong University of Technology, Guangzhou, 510006, China RIKEN Center for Advanced Intelligence Project, Tokyo, 1030027, Japan, School of Automation, Guangdong University of Technology, Guangzhou, 510006, China Key Laboratory of Intelligent Detection and The Internet of Things in Manufacturing, Ministry of Education, Guangzhou, 510006, China, RIKEN Center for Advanced Intelligence Project, Tokyo, 1030027, Japan, School of Automation, Guangdong University of Technology, Guangzhou, 510006, China, RIKEN Center for Advanced Intelligence Project, Tokyo, 1030027, Japan School of Automation, Guangdong University of Technology, Guangzhou, 510006, China
Abstract:
Conventional Outlier Robust Tensor Decomposition (ORTD) approaches generally represent sparse outlier corruption within a specific mode. However, such an assumption, which may hold for matrices, proves inadequate when applied to highorder tensors. In the tensor domain, the outliers are prone to be corrupted in multiple modes simultaneously. Addressing this limitation, this study proposes a novel ORTD approach by recovering low-rank tensors contaminated by outliers spanning multiple modes. In particular, we conceptualize outliers within high-order tensors as latent tensor group sparsity by decomposing the corrupted tensor into a sum of multiple latent components, where each latent component is exclusive to outliers within a particular direction. Thus, it can effectively mitigate the outlier corruptions prevalent in high-order tensors across multiple modes. To theoretically guarantee recovery performance, we rigorously analyze a non-asymptotic upper bound of the estimation error for the proposed ORTD approach. In the optimization process, we develop an efficient alternate direction method of multipliers (ADMM) algorithm. Empirical validation of the approach's efficacy is undertaken through comprehensive experimentation.



Paperid:1634
Authors:Zirou Qiu, Abhijin Adiga, Madhav V. Marathe, S. S. Ravi, Daniel J. Rosenkrantz, Richard E. Stearns, Anil Vullikanti
Computer Science Dept., University of Virginia Biocomplexity Institute, University of Virginia, Biocomplexity Institute, University of Virginia, Computer Science Dept., University of Virginia Biocomplexity Institute, University of Virginia, Biocomplexity Institute, University of Virginia Computer Science Dept., University at Albany – State University of New York., Biocomplexity Institute, University of Virginia Computer Science Dept., University at Albany – State University of New York., Biocomplexity Institute, University of Virginia Computer Science Dept., University at Albany – State University of New York., Computer Science Dept., University of Virginia Biocomplexity Institute, University of Virginia
Abstract:
Discrete dynamical systems are commonly used to model the spread of contagions on realworld networks. Under the PAC framework, existing research has studied the problem of learning the behavior of a system, assuming that the underlying network is known. In this work, we focus on a more challenging setting: to learn both the behavior and the underlying topology of a black-box system. We show that, in general, this learning problem is computationally intractable. On the positive side, we present efficient learning methods under the PAC model when the underlying graph of the dynamical system belongs to certain classes. Further, we examine a relaxed setting where the topology of an unknown system is partially observed. For this case, we develop an efficient PAC learner to infer the system and establish the sample complexity. Lastly, we present a formal analysis of the expressive power of the hypothesis class of dynamical systems where both the topology and behavior are unknown, using the well-known Natarajan dimension formalism. Our results provide a theoretical foundation for learning both the topology and behavior of discrete dynamical systems.



Paperid:1635
Authors:Jiahui Qu, Yuanbo Yang, Wenqian Dong, Yufei Yang
State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China, State Key Laboratory of Integrated Service Network, Xidian University, Xi’an 710071, China
Abstract:
Recent research on the joint classification of multimodal remote sensing data has achieved great success. However, due to the limitations imposed by imaging conditions, the case of missing modalities often occurs in practice. Most previous researchers regard the classification in case of different missing modalities as independent tasks. They train a specific classification model for each fixed missing modality by extracting multimodal joint representation, which cannot handle the classification of arbitrary (including multiple and random) missing modalities. In this work, we propose a local diffusion sharedspecific autoencoder (LDS2AE), which solves the classification of arbitrary missing modalities with a single model. The LDS2AE captures the data distribution of different modalities to learn multimodal shared feature for classification by designing a novel local diffusion autoencoder which consists of a modality-shared encoder and several modality-specific decoders. The modality-shared encoder is designed to extract multimodal shared feature by employing the same parameters to map multimodal data into a shared subspace. The modality-specific decoders put the multimodal shared feature to reconstruct the image of each modality, which facilitates the shared feature to learn unique information of different modalities. In addition, we incorporate masked training to the diffusion autoencoder to achieve local diffusion, which significantly reduces the training cost of model. The approach is tested on widely-used multimodal remote sensing datasets, demonstrating the effectiveness of the proposed LDS2AE in addressing the classification of arbitrary missing modalities. The code is available at https://github.com/Jiahuiqu/LDS2AE.



Paperid:1636
Authors:Xiaofan Que, Qi Yu
Rochester Institute of Technology, Rochester Institute of Technology
Abstract:
Fewshot learning (FSL) is essential in many practical applications. However, the limited training examples make the models more vulnerable to label noise, which can lead to poor generalization capability. To address this critical challenge, we propose a curriculum meta-learning model that employs a novel dual-level class-example sampling strategy to create a robust curriculum for adaptive task distribution formulation and robust model training. The dual-level framework proposes a heuristic class sampling criterion that measures pairwise class boundary complexity to form a class curriculum; it uses effective example sampling through an under-trained proxy model to form an example curriculum. By utilizing both class-level and example-level information, our approach is more robust to handle limited training data and noisy labels that commonly occur in few-shot learning tasks. The model has efficient convergence behavior, which is verified through rigorous convergence analysis. Additionally, we establish a novel error bound through a hierarchical PAC-Bayesian analysis for curriculum meta-learning under noise. We conduct extensive experiments that demonstrate the effectiveness of our framework in outperforming existing noisy few-shot learning methods under various few-shot classification benchmarks. Our code is available at https://github.com/ritmininglab/DCML.



Paperid:1637
Authors:Victor Quétu, Enzo Tartaglione
LTCI, Télécom Paris, Institut Polytechnique de Paris, France, LTCI, Télécom Paris, Institut Polytechnique de Paris, France
Abstract:
Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as reinitialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2.



Paperid:1638
Authors:Reilly Raab, Ross Boczar, Maryam Fazel, Yang Liu
University of California, Santa Cruz, University of Washington, University of Washington, University of California, Santa Cruz
Abstract:
Leading approaches to algorithmic fairness and policyinduced distribution shift are often misaligned with long-term objectives in sequential settings. We aim to correct these shortcomings by ensuring that both the objective and fairness constraints account for policy-induced distribution shift. First, we motivate this problem using an example in which individuals subject to algorithmic predictions modulate their willingness to participate with the policy maker. Fairness in this example is measured by the variance of group participation rates. Next, we develop a method for solving the resulting constrained, non-linear optimization problem and prove that this method converges to a fair, locally optimal policy given first-order information. Finally, we experimentally validate our claims in a semi-synthetic setting.



Paperid:1639
Authors:Sai Niranjan Ramachandran, Rudrabha Mukhopadhyay, Madhav Agarwal, C.V. Jawahar, Vinay Namboodiri
Indian Institute Of Science, Bangalore, International Institute of Information Technology, Hyderabad, International Institute of Information Technology, Hyderabad, International Institute of Information Technology, Hyderabad, University of Bath
Abstract:
This work tackles the important task of understanding outof-distribution behavior in two prominent types of generative models, i.e., GANs and Diffusion models. Understanding this behavior is crucial in understanding their broader utility and risks as these systems are increasingly deployed in our daily lives. Our first contribution is demonstrating that diffusion spaces outperform GANs' latent spaces in inverting high-quality OOD images. We also provide a theoretical analysis attributing this to the lack of prior holes in diffusion spaces. Our second significant contribution is to provide a theoretical hypothesis that diffusion spaces can be projected onto a bounded hypersphere, enabling image manipulation through geodesic traversal between inverted images. Our analysis shows that different geodesics share common attributes for the same manipulation, which we leverage to perform various image manipulations. We conduct thorough empirical evaluations to support and validate our claims. Finally, our third and final contribution introduces a novel approach to the few-shot sampling for out-of-distribution data by inverting a few images to sample from the cluster formed by the inverted latents. The proposed technique achieves state-of-the-art results for the few-shot generation task in terms of image quality. Our research underscores the promise of diffusion spaces in out-of-distribution imaging and offers avenues for further exploration. Please find more details about our project at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/diffusionOOD}



Paperid:1640
Authors:Nisal Ranasinghe, Damith Senanayake, Sachith Seneviratne, Malin Premaratne, Saman Halgamuge
University of Melbourne, University of Melbourne, University of Melbourne, Monash University, University of Melbourne
Abstract:
Traditional machine learning is generally treated as a blackbox optimization problem and does not typically produce interpretable functions that connect inputs and outputs. However, the ability to discover such interpretable functions is desirable. In this work, we propose GINN-LP, an interpretable neural network to discover the form and coefficients of the underlying equation of a dataset, when the equation is assumed to take the form of a multivariate Laurent Polynomial. This is facilitated by a new type of interpretable neural network block, named the “power-term approximator block”, consisting of logarithmic and exponential activation functions. GINN-LP is end-to-end differentiable, making it possible to use backpropagation for training. We propose a neural network growth strategy that will enable finding the suitable number of terms in the Laurent polynomial that represents the data, along with sparsity regularization to promote the discovery of concise equations. To the best of our knowledge, this is the first model that can discover arbitrary multivariate Laurent polynomial terms without any prior information on the order. Our approach is first evaluated on a subset of data used in SRBench, a benchmark for symbolic regression. We first show that GINN-LP outperforms the state-of-the-art symbolic regression methods on datasets generated using 48 real-world equations in the form of multivariate Laurent polynomials. Next, we propose an ensemble method that combines our method with a high-performing symbolic regression method, enabling us to discover non-Laurent polynomial equations. We achieve state-of-the-art results in equation discovery, showing an absolute improvement of 7.1% over the best contender, by applying this ensemble method to 113 datasets within SRBench with known ground-truth equations.



Paperid:1641
Authors:Muhammad Rashid, Elvio G. Amparore, Enrico Ferrari, Damiano Verda
University of Torino, Computer Science Department, C.so Svizzera 185, 10149 Torino, Italy, University of Torino, Computer Science Department, C.so Svizzera 185, 10149 Torino, Italy, Rulex Innovation Labs, Via Felice Romani 9, 16122 Genova, Italy, Rulex Innovation Labs, Via Felice Romani 9, 16122 Genova, Italy
Abstract:
We investigate the use of a stratified sampling approach for LIME Image, a popular modelagnostic explainable AI method for computer vision tasks, in order to reduce the artifacts generated by typical Monte Carlo sampling. Such artifacts are due to the undersampling of the dependent variable in the synthetic neighborhood around the image being explained, which may result in inadequate explanations due to the impossibility of fitting a linear regressor on the sampled data. We then highlight a connection with the Shapley theory, where similar arguments about undersampling and sample relevance were suggested in the past. We derive all the formulas and adjustment factors required for an unbiased stratified sampling estimator. Experiments show the efficacy of the proposed approach.



Paperid:1642
Authors:Abbavaram Gowtham Reddy, Vineeth N Balasubramanian
Indian Institute of Technology, Hyderabad, Indian Institute of Technology, Hyderabad
Abstract:
Causal effect estimation from observational data is a central problem in causal inference. Methods based on potential outcomes framework solve this problem by exploiting inductive biases and heuristics from causal inference. Each of these methods addresses a specific aspect of causal effect estimation, such as controlling propensity score, enforcing randomization, etc., by designing neural network (NN) architectures and regularizers. In this paper, we propose an adaptive method called Neurosymbolic Causal Effect Estimator (NESTER), a generalized method for causal effect estimation. NESTER integrates the ideas used in existing methods based on multihead NNs for causal effect estimation into one framework. We design a Domain Specific Language (DSL) tailored for causal effect estimation based on causal inductive biases used in literature. We conduct a theoretical analysis to investigate NESTER's efficacy in estimating causal effects. Our comprehensive empirical results show that NESTER performs better than state-of-the-art methods on benchmark datasets.



Paperid:1643
Authors:Abbavaram Gowtham Reddy, Saketh Bachu, Harsharaj Pathak, Benin Godfrey L, Varshaneya V, Vineeth N Balasubramanian, Satyanarayan Kar
Indian Institute of Technology Hyderabad, India, Indian Institute of Technology Hyderabad, India, Indian Institute of Technology Hyderabad, India, Indian Institute of Technology Hyderabad, India, Honeywell, Bengaluru, India, Indian Institute of Technology Hyderabad, India, Honeywell, Bengaluru, India
Abstract:
Recently, there has been a growing interest in learning and explaining causal effects within Neural Network (NN) models. By virtue of NN architectures, previous approaches consider only direct and total causal effects assuming independence among input variables. We view an NN as a structural causal model (SCM) and extend our focus to include indirect causal effects by introducing feedforward connections among input neurons. We propose an antehoc method that captures and maintains direct, indirect, and total causal effects during NN model training. We also propose an algorithm for quantifying learned causal effects in an NN model and efficient approximation strategies for quantifying causal effects in high-dimensional data. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the causal effects learned by our ante-hoc method better approximate the ground truth effects compared to existing methods.



Paperid:1644
Authors:Li Ren, Chen Chen, Liqiang Wang, Kien Hua
University of Central Florida, University of Central Florida, University of Central Florida, University of Central Florida
Abstract:
Deep Metric Learning (DML) plays an important role in modern computer vision research, where we learn a distance metric for a set of image representations. Recent DML techniques utilize the proxy to interact with the corresponding image samples in the embedding space. However, existing proxybased DML methods focus on learning individual proxy-to-sample distance, while the overall distribution of samples and proxies lacks attention. In this paper, we present a novel proxy-based DML framework that focuses on aligning the sample and proxy distributions to improve the efficiency of proxy-based DML losses. Specifically, we propose the Data-Augmented Domain Adaptation (DADA) method to adapt the domain gap between the group of samples and proxies. To the best of our knowledge, we are the first to leverage domain adaptation to boost the performance of proxy-based DML. We show that our method can be easily plugged into existing proxy-based DML losses. Our experiments on benchmarks, including the popular CUB-200-2011, CARS196, Stanford Online Products, and In-Shop Clothes Retrieval, show that our learning algorithm significantly improves the existing proxy losses and achieves superior results compared to the existing methods. The code and Appendix are available at: https://github.com/Noahsark/DADA



Paperid:1645
Authors:Yinuo Ren, Yiping Lu, Lexing Ying, Grant M. Rotskoff
Stanford University, NYU, Stanford University, Stanford University
Abstract:
Inferring a diffusion equation from discretely observed measurements is a statistical challenge of significant importance in a variety of fields, from singlemolecule tracking in biophysical systems to modeling financial instruments. Assuming that the underlying dynamical process obeys a d-dimensional stochastic differential equation of the form dx_t = b(x_t)dt + \Sigma(x_t)dw_t, we propose neural network-based estimators of both the drift b and the spatially-inhomogeneous diffusion tensor D = \Sigma\Sigma^T/2 and provide statistical convergence guarantees when b and D are s-Hölder continuous. Notably, our bound aligns with the minimax optimal rate N^{-\frac{2s}{2s+d}} for nonparametric function estimation even in the presence of correlation within observational data, which necessitates careful handling when establishing fast-rate generalization bounds. Our theoretical results are bolstered by numerical experiments demonstrating accurate inference of spatially-inhomogeneous diffusion tensors.



Paperid:1646
Authors:Rob Romijnders, Christos Louizos, Yuki M. Asano, Max Welling
University of Amsterdam, Qualcomm AI research, University of Amsterdam, University of Amsterdam
Abstract:
The pandemic in 2020 and 2021 had enormous economic and societal consequences, and studies show that contact tracing algorithms can be key in the early containment of the virus. While large strides have been made towards more effective contact tracing algorithms, we argue that privacy concerns currently hold deployment back. The essence of a contact tracing algorithm constitutes the communication of a risk score. Yet, it is precisely the communication and release of this score to a user that an adversary can leverage to gauge the private health status of an individual. We pinpoint a realistic attack scenario and propose a contact tracing algorithm with differential privacy guarantees against this attack. The algorithm is tested on the two most widely used agentbased COVID19 simulators and demonstrates superior performance in a wide range of settings. Especially for realistic test scenarios and while releasing each risk score with epsilon=1 differential privacy, we achieve a two to ten-fold reduction in the infection rate of the virus. To the best of our knowledge, this presents the first contact tracing algorithm with differential privacy guarantees when revealing risk scores for COVID19.



Paperid:1647
Authors:Harrison Rosenberg, Shimaa Ahmed, Guruprasad Ramesh, Kassem Fawaz, Ramya Korlakai Vinayak
University of Wisconsin-Madison, University of Wisconsin-Madison, University of Wisconsin-Madison, University of Wisconsin-Madison, University of Wisconsin-Madison
Abstract:
Textto-image diffusion models have achieved widespread popularity due to their unprecedented image generation capability. In particular, their ability to synthesize and modify human faces has spurred research into using generated face images in both training data augmentation and model performance assessments. In this paper, we study the efficacy and shortcomings of generative models in the context of face generation. Utilizing a combination of qualitative and quantitative measures, including embedding-based metrics and user studies, we present a framework to audit the characteristics of generated faces conditioned on a set of social attributes. We applied our framework on faces generated through state-of-the-art text-to-image diffusion models. We identify several limitations of face image generation that include faithfulness to the text prompt, demographic disparities, and distributional shifts. Furthermore, we present an analytical model that provides insights into how training data selection contributes to the performance of generative models. Our survey data and analytics code can be found online at https://github.com/wi-pi/Limitations_of_Face_Generation



Paperid:1648
Authors:Shuvendu Roy, Ali Etemad
Queen's University, Queen's University
Abstract:
We propose UnMixMatch, a semisupervised learning framework which can learn effective representations from unconstrained unlabelled data in order to scale up performance. Most existing semi-supervised methods rely on the assumption that labelled and unlabelled samples are drawn from the same distribution, which limits the potential for improvement through the use of free-living unlabeled data. Consequently, the generalizability and scalability of semi-supervised learning are often hindered by this assumption. Our method aims to overcome these constraints and effectively utilize unconstrained unlabelled data in semi-supervised learning. UnMixMatch consists of three main components: a supervised learner with hard augmentations that provides strong regularization, a contrastive consistency regularizer to learn underlying representations from the unlabelled data, and a self-supervised loss to enhance the representations that are learnt from the unlabelled data. We perform extensive experiments on 4 commonly used datasets and demonstrate superior performance over existing semi-supervised methods with a performance boost of 4.79%. Extensive ablation and sensitivity studies show the effectiveness and impact of each of the proposed components of our method. The code for our work is publicly available.



Paperid:1649
Authors:Hyun Ryu, Sunjae Yoon, Hee Suk Yoon, Eunseop Yoon, Chang D. Yoo
KAIST, KAIST, KAIST, KAIST, KAIST
Abstract:
Data augmentation is a crucial component in training neural networks to overcome the limitation imposed by data size, and several techniques have been studied for time series. Although these techniques are effective in certain tasks, they have yet to be generalized to time series benchmarks. We find that current data augmentation techniques ruin the core information contained within the frequency domain. To address this issue, we propose a simple strategy to preserve spectral information (SimPSI) in time series data augmentation. SimPSI preserves the spectral information by mixing the original and augmented input spectrum weighted by a preservation map, which indicates the importance score of each frequency. Specifically, our experimental contributions are to build three distinct preservation maps: magnitude spectrum, saliency map, and spectrumpreservative map. We apply SimPSI to various time series data augmentations and evaluate its effectiveness across a wide range of time series benchmarks. Our experimental results support that SimPSI considerably enhances the performance of time series data augmentations by preserving core spectral information. The source code used in the paper is available at https://github.com/Hyun-Ryu/simpsi.



Paperid:1650
Authors:Augusto Santos, Diogo Rente, Rui Seabra, José M. F. Moura
Instituto de Telecomunicações-IT, Lisbon, Portugal, Department of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, PA, USA, Department of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, PA, USA, Department of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, PA, USA
Abstract:
This paper considers learning the hidden causal network of a linear networked dynamical system (NDS) from the time series data at some of its nodes partial observability. The dynamics of the NDS are driven by colored noise that generates spurious associations across pairs of nodes, rendering the problem much harder. To address the challenge of noise correlation and partial observability, we assign to each pair of nodes a feature vector computed from the time series data of observed nodes. The feature embedding is engineered to yield structural consistency: there exists an affine hyperplane that consistently partitions the set of features, separating the feature vectors corresponding to connected pairs of nodes from those corresponding to disconnected pairs. The causal inference problem is thus addressed via clustering the designed features. We demonstrate with simple baseline supervised methods the competitive performance of the proposed causal inference mechanism under broad connectivity regimes and noise correlation levels, including a real world network. Further, we devise novel technical guarantees of structural consistency for linear NDS under the considered regime.



Paperid:1651
Authors:Pritam Sarkar, Ali Etemad
Queen's Univesity Vector Institute, Queen's University
Abstract:
We present XKD, a novel selfsupervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by 8% to 14% on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by 5.5% on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of 96.5%.



Paperid:1652
Authors:Johannes Schneider, Mohit Prabhushankar
University of Liechtenstein, Georgia Institute of Technology
Abstract:
The learning dynamics of deep neural networks are not well understood. The information bottleneck (IB) theory proclaimed separate fitting and compression phases. But they have since been heavily debated. We comprehensively analyze the learning dynamics by investigating a layer's reconstruction ability of the input and prediction performance based on the evolution of parameters during training. We empirically show the existence of three phases using common datasets and architectures such as ResNet and VGG: (i) near constant reconstruction loss, (ii) decrease, and (iii) increase. We also derive an empirically grounded data model and prove the existence of phases for singlelayer networks. Technically, our approach leverages classical complexity analysis. It differs from IB by relying on measuring reconstruction loss rather than information theoretic measures to relate information of intermediate layers and inputs. Our work implies a new best practice for transfer learning: We show empirically that the pre-training of a classifier should stop well before its performance is optimal.



Paperid:1653
Authors:Caleb Schultz Kisby, Saúl A. Blanco, Lawrence S. Moss
Department of Computer Science, Indiana University, Department of Computer Science, Indiana University, Department of Mathematics, Indiana University
Abstract:
This paper is a contribution to neural network semantics, a foundational framework for neurosymbolic AI. The key insight of this theory is that logical operators can be mapped to operators on neural network states. In this paper, we do this for a neural network learning operator. We map a dynamic operator [φ] to iterated Hebbian learning, a simple learning policy that updates a neural network by repeatedly applying Hebb's learning rule until the net reaches a fixed-point. Our main result is that we can "translate away" [φ]-formulas via reduction axioms. This means that completeness for the logic of iterated Hebbian learning follows from completeness of the base logic. These reduction axioms also provide (1) a human-interpretable description of iterated Hebbian learning as a kind of plausibility upgrade, and (2) an approach to building neural networks with guarantees on what they can learn.



Paperid:1654
Authors:Sanket Shah, Bryan Wilder, Andrew Perrault, Milind Tambe
Harvard University, Carnegie Mellon University, The Ohio State University, Harvard University
Abstract:
Predictthen-Optimize is a framework for using machine learning to perform decision-making under uncertainty. The central research question it asks is, "How can we use the structure of a decision-making task to tailor ML models for that specific task?" To this end, recent work has proposed learning task-specific loss functions that capture this underlying structure. However, current approaches make restrictive assumptions about the form of these losses and their impact on ML model behavior. These assumptions both lead to approaches with high computational cost, and when they are violated in practice, poor performance. In this paper, we propose solutions to these issues, avoiding the aforementioned assumptions and utilizing the ML model's features to increase the sample efficiency of learning loss functions. We empirically show that our method achieves state-of-the-art results in four domains from the literature, often requiring an order of magnitude fewer samples than comparable methods from past work. Moreover, our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.



Paperid:1655
Authors:Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva
Department of Computer Science, University of North Carolina at Chapel Hill, Department of Computer Science, University of North Carolina at Chapel Hill, Department of Computer Science, University of North Carolina at Chapel Hill, Department of Computer Science, University of North Carolina at Chapel Hill
Abstract:
Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method Phoneme Hallucinator that achieves the best of both worlds. Phoneme Hallucinator is a oneshot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Quantitative and qualitative evaluations show that Phoneme Hallucinator outperforms existing VC methods for both intelligibility and speaker similarity.



Paperid:1656
Authors:Dravyansh Sharma
Carnegie Mellon University
Abstract:
Internal regret is a measure of performance of an online learning algorithm, which measures the change in performance by substituting every occurrence of a given action i by an alternative action j. Algorithms for minimizing internal regret are known for the finite experts setting, including a general reduction to the problem of minimizing external regret for this case. The reduction however crucially depends on the finiteness of the action space. In this work we approach the problem of minimizing internal regret for a continuous action space. For the full information setting, we show how to obtain O(sqrt(T)) internal regret for the class of Lipschitz functions, as well as nonLipschitz dispersed functions, i.e. the non-Lipschitzness may not concentrate in a small region of the action space. We also consider extensions to partial feedback settings, and again obtain sublinear internal regret. Finally we discuss applications of internal regret minimization over continuous spaces to correlated equilibria in pricing problems and auction design, as well as to data-driven hyperparameter tuning.



Paperid:1657
Authors:Junhao Shen, Hong Qian, Wei Zhang, Aimin Zhou
East China Normal University, East China Normal University, East China Normal University, East China Normal University
Abstract:
Cognitive diagnosis assessment is a fundamental and crucial task for student learning. It models the studentexercise interaction, and discovers the students' proficiency levels on each knowledge attribute. In real-world intelligent education systems, generalization and interpretability of cognitive diagnosis methods are of equal importance. However, most existing methods can hardly make the best of both worlds due to the complicated student-exercise interaction. To this end, this paper proposes a symbolic cognitive diagnosis~(SCD) framework to simultaneously enhance generalization and interpretability. The SCD framework incorporates the symbolic tree to explicably represent the complicated student-exercise interaction function, and utilizes gradient-based optimization methods to effectively learn the student and exercise parameters. Meanwhile, the accompanying challenge is that we need to tunnel the discrete symbolic representation and continuous parameter optimization. To address this challenge, we propose to hybridly optimize the representation and parameters in an alternating manner. To fulfill SCD, it alternately learns the symbolic tree by derivative-free genetic programming and learns the student and exercise parameters via gradient-based Adam. The extensive experimental results on various real-world datasets show the superiority of SCD on both generalization and interpretability. The ablation study verifies the efficacy of each ingredient in SCD, and the case study explicitly showcases how the interpretable ability of SCD works.



Paperid:1658
Authors:Zhecheng Sheng, Tianhao Zhang, Chen Jiang, Dongyeop Kang
University of Minnesota, Minneapolis, MN, University of Minnesota, Minneapolis, MN, University of Minnesota, Minneapolis, MN, University of Minnesota, Minneapolis, MN
Abstract:
Measuring the coherence of text is a vital aspect of evaluating the quality of written content. Recent advancements in neural coherence modeling have demonstrated their efficacy in capturing entity coreference and discourse relations, thereby enhancing coherence evaluation. However, many existing methods heavily depend on static embeddings or focus narrowly on nearby context, constraining their capacity to measure the overarching coherence of long texts. In this paper, we posit that coherent texts inherently manifest a sequential and cohesive interplay among sentences, effectively conveying the central theme, purpose, or standpoint. To explore this abstract relationship, we introduce the "BB Score," a novel referencefree metric grounded in Brownian bridge theory for assessing text coherence. Our findings showcase that when synergized with a simple additional classification component, this metric attains a performance level comparable to state-of-the-art techniques on standard artificial discrimination tasks. We also establish in downstream tasks that this metric effectively differentiates between human-written documents and text generated by large language models within specific domains. Furthermore, we illustrate the efficacy of this approach in detecting written styles attributed to various large language models, underscoring its potential for generalizability. In summary, we present a novel Brownian bridge coherence metric capable of measuring both local and global text coherence, while circumventing the need for end-to-end model training. This flexibility allows for its application in various downstream tasks.



Paperid:1659
Authors:Boyu Shi, Shiyu Xia, Xu Yang, Haokun Chen, Zhiqiang Kou, Xin Geng
Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Abstract:
Recently, Stitchable Neural Networks (SNNet) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challenges. 1) Stitching from multiple independently pre-trained anchors introduces high storage resource consumption. 2) SN-Net faces challenges to build smaller models for low resource constraints. 3). SN-Net uses an unlearned initialization method for stitch layers, limiting the final performance. To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method called Learngene Pool. Briefly, Learngene distills the critical knowledge from a large pre-trained model into a small part (termed as learngene) and then expands this small part into a few variable-sized models. In our proposed method, we distill one pre-trained large model into multiple small models whose network blocks are used as learngene instances to construct the learngene pool. Since only one large model is used, we do not need to store more large models as SN-Net and after distilling, smaller learngene instances can be created to build small models to satisfy low resource constraints. We also insert learnable transformation matrices between the instances to stitch them into variable-sized models to improve the performance of these models. Exhaustive experiments have been implemented and the results validate the effectiveness of the proposed Learngene Pool compared with SN-Net.



Paperid:1660
Authors:Jiangming Shi, Shanshan Zheng, Xiangbo Yin, Yang Lu, Yuan Xie, Yanyun Qu
Xiamen University, Xiamen University, Xiamen University, Xiamen University, East China Normal University, Xiamen University
Abstract:
Federated learning (FL) provides a decentralized machine learning paradigm where a server collaborates with a group of clients to learn a global model without accessing the clients' data. User heterogeneity is a significant challenge for FL, which together with the classdistribution imbalance further enhances the difficulty of FL. Great progress has been made in large vision-language models, such as Contrastive Language-Image Pre-training (CLIP), which paves a new way for image classification and object recognition. Inspired by the success of CLIP on few-shot and zero-shot learning, we use CLIP to optimize the federated learning between server and client models under its vision-language supervision. It is promising to mitigate the user heterogeneity and class-distribution balance due to the powerful cross-modality representation and rich open-vocabulary prior knowledge. In this paper, we propose the CLIP-guided FL (CLIP2FL) method on heterogeneous and long-tailed data. In CLIP2FL, the knowledge of the off-the-shelf CLIP model is transferred to the client-server models, and a bridge is built between the client and server. Specifically, for client-side learning, knowledge distillation is conducted between client models and CLIP to improve the ability of client-side feature representation. For server-side learning, in order to mitigate the heterogeneity and class-distribution imbalance, we generate federated features to retrain the server model. A prototype contrastive learning with the supervision of the text encoder of CLIP is introduced to generate federated features depending on the client-side gradients, and they are used to retrain a balanced server classifier. Extensive experimental results on several benchmarks demonstrate that CLIP2FL achieves impressive performance and effectively deals with data heterogeneity and long-tail distribution. The code is available at https://github.com/shijiangming1/CLIP2FL.



Paperid:1661
Authors:Lei Shi, Bin Hu, Deng Zhao, Jianshan He, Zhiqiang Zhang, Jun Zhou
Machine Intelligence Department, Ant Group, Machine Intelligence Department, Ant Group, Machine Intelligence Department, Ant Group, Consumer Finance Technology Department, Ant Group, Machine Intelligence Department, Ant Group, Machine Intelligence Department, Ant Group
Abstract:
Link prediction is a fundamental task of graph machine learning, and Graph Neural Network (GNN) based methods have become the mainstream approach due to their good performance. However, the typical practice learns node representations through neighborhood aggregation, lacking awareness of the structural relationships between target nodes. Recently, some methods have attempted to address this issue by node labeling tricks. However, they still rely on the nodecentric neighborhood message passing of GNNs, which we believe involves two limitations in terms of information perception and transmission for link prediction. First, it cannot perceive long-range structural information due to the restricted receptive fields. Second, there may be information loss of node-centric model on link-centric task. In addition, we empirically find that the neighbor node features could introduce noise for link prediction. To address these issues, we propose a structural information enhanced link prediction framework, which involves removing the neighbor node features while fitting neighborhood graph structures more focused through GNN. Furthermore, we introduce Binary Structural Transformer (BST) to encode the structural relationships between target nodes, complementing the deficiency of GNN. Our approach achieves remarkable results on multiple popular benchmarks, including ranking first on ogbl-ppa, ogbl-citation2 and Pubmed.



Paperid:1662
Authors:Lianghe Shi, Weiwei Liu
Wuhan University, Wuhan University
Abstract:
Curriculum adversarial training empirically finds that gradually increasing the hardness of adversarial examples can further improve the adversarial robustness of the trained model compared to conventional adversarial training. However, theoretical understanding of this strategy remains limited. In an attempt to bridge this gap, we analyze the adversarial training process from an online perspective. Specifically, we treat adversarial examples in different iterations as samples from different adversarial distributions. We then introduce the time series prediction framework and deduce novel generalization error bounds. Our theoretical results not only demonstrate the effectiveness of the conventional adversarial training algorithm but also explain why curriculum adversarial training methods can further improve adversarial generalization. We conduct comprehensive experiments to support our theory.



Paperid:1663
Authors:Liangliang Shi, Zhaoqi Shen, Junchi Yan
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Optimal transport (OT) is attracting increasing attention in machine learning. It aims to transport a source distribution to a target one at minimal cost. In its vanilla form, the source and target distributions are predetermined, which contracts to the realworld case involving undetermined targets. In this paper, we propose Doubly Bounded Optimal Transport (DB-OT), which assumes that the target distribution is restricted within two boundaries instead of a fixed one, thus giving more freedom for the transport to find solutions. Based on the entropic regularization of DB-OT, three scaling-based algorithms are devised for calculating the optimal solution. We also show that our DB-OT is helpful for barycenter-based clustering, which can avoid the excessive concentration of samples in a single cluster. Then we further develop DB-OT techniques for long-tailed classification which is an emerging and open problem. We first propose a connection between OT and classification, that is, in the classification task, training involves optimizing the Inverse OT to learn the representations, while testing involves optimizing the OT for predictions. with this OT perspective, we first apply DB-OT to improve the loss, and the Balanced Softmax is shown as a special case. Then we apply DB-OT for inference in the testing process. Even with vanilla Softmax trained features, our experiments show that our method can achieve good results with our improved inference scheme in the testing stage.



Paperid:1664
Authors:Hyunjune Shin, Dong-Wan Choi
Inha University, Inha University
Abstract:
Datafree knowledge distillation (DFKD) aims to distill pretrained knowledge to a student model with the help of a generator without using original data. In such data-free scenarios, achieving stable performance of DFKD is essential due to the unavailability of validation data. Unfortunately, this paper has discovered that existing DFKD methods are quite sensitive to different teacher models, occasionally showing catastrophic failures of distillation, even when using well-trained teacher models. Our observation is that the generator in DFKD is not always guaranteed to produce precise yet diverse samples using the existing representative strategy of minimizing both class-prior and adversarial losses. Through our empirical study, we focus on the fact that class-prior not only decreases the diversity of generated samples, but also cannot completely address the problem of generating unexpectedly low-quality samples depending on teacher models. In this paper, we propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method, with the goal of more robust and stable performance regardless of teacher models. Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator. Specifically, we design a sample selection approach that takes only clean samples verified by the teacher model without imposing restrictions on the power of generating diverse samples. Through extensive experiments, we show that our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.



Paperid:1665
Authors:Sangwoo Shin, Minjong Yoo, Jeongwoo Lee, Honguk Woo
Department of Computer Science and Engineering Sungkyunkwan University, Department of Computer Science and Engineering Sungkyunkwan University, Department of Computer Science and Engineering Sungkyunkwan University, Department of Computer Science and Engineering Sungkyunkwan University
Abstract:
This work explores the zeroshot adaptation capability of semantic skills, semantically interpretable experts' behavior patterns, in cross-domain settings, where a user input in interleaved multi-modal snippets can prompt a new long-horizon task for different domains. In these cross-domain settings, we present a semantic skill translator framework SemTra which utilizes a set of multi-modal models to extract skills from the snippets, and leverages the reasoning capabilities of a pretrained language model to adapt these extracted skills to the target domain. The framework employs a two-level hierarchy for adaptation: task adaptation and skill adaptation. During task adaptation, seq-to-seq translation by the language model transforms the extracted skills into a semantic skill sequence, which is tailored to fit the cross-domain contexts. Skill adaptation focuses on optimizing each semantic skill for the target domain context, through parametric instantiations that are facilitated by language prompting and contrastive learning-based context inferences. This hierarchical adaptation empowers the framework to not only infer a complex task specification in one-shot from the interleaved multi-modal snippets, but also adapt it to new domains with zero-shot learning abilities. We evaluate our framework with Meta-World, Franka Kitchen, RLBench, and CARLA environments. The results clarify the framework's superiority in performing long-horizon tasks and adapting to different domains, showing its broad applicability in practical use cases, such as cognitive robots interpreting abstract instructions and autonomous vehicles operating under varied configurations.



Paperid:1666
Authors:Debaditya Shome, Pritam Sarkar, Ali Etemad
Queen’s University, Canada, Queen’s University, Canada, Queen’s University, Canada
Abstract:
The high prevalence of cardiovascular diseases (CVDs) calls for accessible and costeffective continuous cardiac monitoring tools. Despite Electrocardiography (ECG) being the gold standard, continuous monitoring remains a challenge, leading to the exploration of Photoplethysmography (PPG), a promising but more basic alternative available in consumer wearables. This notion has recently spurred interest in translating PPG to ECG signals. In this work, we introduce Region-Disentangled Diffusion Model (RDDM), a novel diffusion model designed to capture the complex temporal dynamics of ECG. Traditional Diffusion models like Denoising Diffusion Probabilistic Models (DDPM) face challenges in capturing such nuances due to the indiscriminate noise addition process across the entire signal. Our proposed RDDM overcomes such limitations by incorporating a novel forward process that selectively adds noise to specific regions of interest (ROI) such as QRS complex in ECG signals, and a reverse process that disentangles the denoising of ROI and non-ROI regions. Quantitative experiments demonstrate that RDDM can generate high-fidelity ECG from PPG in as few as 10 diffusion steps, making it highly effective and computationally efficient. Additionally, to rigorously validate the usefulness of the generated ECG signals, we introduce CardioBench, a comprehensive evaluation benchmark for a variety of cardiac-related tasks including heart rate and blood pressure estimation, stress classification, and the detection of atrial fibrillation and diabetes. Our thorough experiments show that RDDM achieves state-of-the-art performance on CardioBench. To the best of our knowledge, RDDM is the first diffusion model for cross-modal signal-to-signal translation in the bio-signal domain.



Paperid:1667
Authors:Chongjie Si, Zekun Jiang, Xuehui Wang, Yan Wang, Xiaokang Yang, Wei Shen
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, East China Normal University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
In partial label learning (PLL), each instance is associated with a set of candidate labels among which only one is groundtruth. The majority of the existing works focuses on constructing robust classifiers to estimate the labeling confidence of candidate labels in order to identify the correct one. However, these methods usually struggle to rectify mislabeled samples. To help existing PLL methods identify and rectify mislabeled samples, in this paper, we introduce a novel partner classifier and propose a novel ``mutual supervision'' paradigm. Specifically, we instantiate the partner classifier predicated on the implicit fact that non-candidate labels of a sample should not be assigned to it, which is inherently accurate and has not been fully investigated in PLL. Furthermore, a novel collaborative term is formulated to link the base classifier and the partner one. During each stage of mutual supervision, both classifiers will blur each other's predictions through a blurring mechanism to prevent overconfidence in a specific label. Extensive experiments demonstrate that the performance and disambiguation ability of several well-established stand-alone and deep-learning based PLL approaches can be significantly improved by coupling with this learning paradigm.



Paperid:1668
Authors:Tareq Si Salem, Gözde Özcan, Iasonas Nikolaou, Evimaria Terzi, Stratis Ioannidis
Northeastern University, Northeastern University, Boston University, Boston University, Northeastern University
Abstract:
We study monotone submodular maximization under general matroid constraints in the online setting. We prove that online optimization of a large class of submodular functions, namely, threshold potential functions, reduces to online convex optimization (OCO). This is precisely because functions in this class admit a concave relaxation; as a result, OCO policies, coupled with an appropriate rounding scheme, can be used to achieve sublinear regret in the combinatorial setting. We also show that our reduction extends to many different versions of the online learning problem, including the dynamic regret, bandit, and optimisticlearning settings.



Paperid:1669
Authors:Kun Song, Ruben Solozabal, Hao Li, Martin Takáč, Lu Ren, Fakhri Karray
Mohamed bin Zayed University of Artificial Intelligence, Mohamed bin Zayed University of Artificial Intelligence, Anhui University, Mohamed bin Zayed University of Artificial Intelligence, Anhui University, Mohamed bin Zayed University of Artificial Intelligence
Abstract:
In this paper, we find that the training of Normalizing Flows (NFs) are easily affected by the outliers and a small number (or high dimensionality) of training samples. To solve this problem, we propose a Kullback–Leibler (KL) divergence regularization on the Jacobian matrix of NFs. We prove that such regularization is equivalent to adding a set of samples whose covariance matrix is the identity matrix to the training set. Thus, it reduces the negative influence of the outliers and the small sample number on the estimation of the covariance matrix, simultaneously. Therefore, our regularization makes the training of NFs robust. Ultimately, we evaluate the performance of NFs on outof-distribution (OoD) detection tasks. The excellent results obtained demonstrate the effectiveness of the proposed regularization term. For example, with the help of the proposed regularization, the OoD detection score increases at most 30% compared with the one without the regularization.



Paperid:1670
Authors:Xiang Song, Yuhang He, Songlin Dong, Yihong Gong
School of Software Engineering, Xi'an Jiaotong University, College of Artificial Intelligence, Xi'an Jiaotong University, College of Artificial Intelligence, Xi'an Jiaotong University, College of Artificial Intelligence, Xi'an Jiaotong University
Abstract:
Domain incremental object detection (DIOD) aims to gradually learn a unified object detection model from a dataset stream composed of different domains, achieving good performance in all encountered domains. The most critical obstacle to this goal is the catastrophic forgetting problem, where the performance of the model improves rapidly in new domains but deteriorates sharply in old ones after a few sessions. To address this problem, we propose a nonexemplar DIOD method named learning domain bias (LDB), which learns domain bias independently at each new session, avoiding saving examples from old domains. Concretely, a base model is first obtained through training during session 1. Then, LDB freezes the weights of the base model and trains individual domain bias for each new incoming domain, adapting the base model to the distribution of new domains. At test time, since the domain ID is unknown, we propose a domain selector based on nearest mean classifier (NMC), which selects the most appropriate domain bias for a test image. Extensive experimental evaluations on two series of datasets demonstrate the effectiveness of the proposed LDB method in achieving high accuracy on new and old domain datasets. The code is available at https://github.com/SONGX1997/LDB.



Paperid:1671
Authors:Bharat Srikishan, Anika Tabassum, Srikanth Allu, Ramakrishnan Kannan, Nikhil Muralidhar
Stevens Institute of Technology, Oak Ridge National Laboratory, Oak Ridge National Laboratory, Oak Ridge National Laboratory, Stevens Institute of Technology
Abstract:
Deep learning architectures have achieved stateof-the-art (SOTA) performance on computer vision tasks such as object detection and image segmentation. This may be attributed to the use of over-parameterized, monolithic deep learning architectures executed on large datasets. Although such large architectures lead to increased accuracy, this is usually accompanied by a larger increase in computation and memory requirements during inference. While this is a non-issue in traditional machine learning (ML) pipelines, the recent confluence of machine learning and fields like the Internet of Things (IoT) has rendered such large architectures infeasible for execution in low-resource settings. For some datasets, large monolithic pipelines may be overkill for simpler inputs. To address this problem, previous efforts have proposed decision cascades where inputs are passed through models of increasing complexity until desired performance is achieved. However, we argue that cascaded prediction leads to sub-optimal throughput and increased computational cost due to wasteful intermediate computations. To address this, we propose PaSeR (Parsimonious Segmentation with Reinforcement Learning) a non-cascading, cost-aware learning pipeline as an efficient alternative to cascaded decision architectures. Through experimental evaluation on both real-world and standard datasets, we demonstrate that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models. Further, we introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance. On the real-world task of battery material phase segmentation, PaSeR yields a minimum performance improvement of 174% on the IoU/GigaFlop metric with respect to baselines. We also demonstrate PaSeR's adaptability to complementary models trained on a noisy MNIST dataset, where it achieved a minimum performance improvement on IoU/GigaFlop of 13.4% over SOTA models. Code and data are available at github.com/scailab/paser.



Paperid:1672
Authors:Uri Stern, Daniel Shwartz, Daphna Weinshall
Hebrew University, Hebrew University, Hebrew University
Abstract:
Deep neural networks have become the method of choice for solving many classification tasks, largely because they can fit very complex functions defined over raw data. The downside of such powerful learners is the danger of overfit. In this paper, we introduce a novel ensemble classifier for deep networks that effectively overcomes overfitting by combining models generated at specific intermediate epochs during training. Our method allows for the incorporation of useful knowledge obtained by the models during the overfitting phase without deterioration of the general performance, which is usually missed when early stopping is used. To motivate this approach, we begin with the theoretical analysis of a regression model, whose prediction that the variance among classifiers increases when overfit occurs - is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method, where the prediction is determined by the class that attains the most consensual prediction throughout the training epochs. Using multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. It is thus a practical and useful tool to overcome overfit.



Paperid:1673
Authors:Cong Su, Guoxian Yu, Jun Wang, Hui Li, Qingzhong Li, Han Yu
Shandong University, Shandong University, Shandong University, Shandong University, Shandong University, Nanyang Technological University (NTU)
Abstract:
Federated learning (FL) has emerged as a promising collaborative and secure paradigm for training a model from decentralized data without compromising privacy. Group fairness and client fairness are two dimensions of fairness that are important for FL. Standard FL can result in disproportionate disadvantages for certain clients, and it still faces the challenge of treating different groups equitably in a population. The problem of privately training fair FL models without compromising the generalization capability of disadvantaged clients remains open. In this paper, we propose a method, called mFairFL, to address this problem and achieve group fairness and client fairness simultaneously. mFairFL leverages differential multipliers to construct an optimization objective for empirical risk minimization with fairness constraints. Before aggregating locally trained models, it first detects conflicts among their gradients, and then iteratively curates the direction and magnitude of gradients to mitigate these conflicts. Theoretical analysis proves mFairFL facilitates the fairness in model development. The experimental evaluations based on three benchmark datasets show significant advantages of mFairFL compared to seven stateof-the-art baselines.



Paperid:1674
Authors:Houcheng Su, Weihao Luo, Daixian Liu, Mengzhu Wang, Jing Tang, Junyang Chen, Cong Wang, Zhenghan Chen
University of Macau, Donghua University, Sichuan Agricultural University, Hebei University of Technology, Hebei University of Technology, Shenzhen University, The Hong Kong Polytechnic University, Microsoft
Abstract:
Domain Generalization (DG) aims to improve the generalization ability of models trained on a specific group of source domains, enabling them to perform well on new, unseen target domains. Recent studies have shown that methods that converge to smooth optima can enhance the generalization performance of supervised learning tasks such as classification. In this study, we examine the impact of smoothnessenhancing formulations on domain adversarial training, which combines task loss and adversarial loss objectives. Our approach leverages the fact that converging to a smooth minimum with respect to task loss can stabilize the task loss and lead to better performance on unseen domains. Furthermore, we recognize that the distribution of objects in the real world often follows a long-tailed class distribution, resulting in a mismatch between machine learning models and our expectations of their performance on all classes of datasets with long-tailed class distributions. To address this issue, we consider the domain generalization problem from the perspective of the long-tail distribution and propose using the maximum square loss to balance different classes which can improve model generalizability. Our method's effectiveness is demonstrated through comparisons with state-of-the-art methods on various domain generalization datasets. Code: https://github.com/bamboosir920/SAMALTDG.



Paperid:1675
Authors:Jiayang Su, Junbo Ma, Songyang Tong, Enze Xu, Minghan Chen
Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541000, China, School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China Lishui Institute of Hangzhou Dianzi University, Hangzhou Dianzi University, Hangzhou 310018, China, Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University, Guilin 541000, China, Department of Computer Science, Wake Forest University, NC 27109, USA, Department of Computer Science, Wake Forest University, NC 27109, USA
Abstract:
In biochemical modeling, some foundational systems can exhibit sudden and profound behavioral shifts, such as the cellular signaling pathway models, in which the physiological responses promptly react to environmental changes, resulting in steep changes in their dynamic model trajectories. These steep changes are one of the major challenges in biochemical modeling governed by nonlinear differential equations. One promising way to tackle this challenge is converting the input data from the time domain to the frequency domain through Fourier Neural Operators, which enhances the ability to analyze data periodicity and regularity. However, the effectiveness of these Fourier based methods diminishes in scenarios with complex abrupt switches. To address this limitation, an innovative Multiscale Attention Wavelet Neural Operator (MAWNO) method is proposed in this paper, which comprehensively combines the attention mechanism with the versatile wavelet transforms to effectively capture these abrupt switches. Specifically, the wavelet transform scrutinizes data across multiple scales to extract the characteristics of abrupt signals into wavelet coefficients, while the selfattention mechanism is adeptly introduced to enhance the wavelet coefficients in high-frequency signals that can better characterize the abrupt switches. Experimental results substantiate MAWNO’s supremacy in terms of accuracy on three classical biochemical models featuring periodic and steep trajectories. https://github.com/SUDERS/MAWNO.



Paperid:1676
Authors:Junhao Su, Zhenghan Chen, Chenghao He, Dongzhi Guan, Changpeng Cai, Tongxi Zhou, Jiashen Wei, Wenhua Tian, Zhihuai Xie
Southeast University, Peking University, East China University of Science and Technology, Southeast University, Southeast University, Institute of Automation Chinese Academy of Sciences, Fudan University, Southeast University, Tsinghua University
Abstract:
Lane detection is the cornerstone of autonomous driving. Although existing methods have achieved promising results, there are still limitations in addressing challenging scenarios such as abnormal weather, occlusion, and curves. These scenarios with low visibility usually require to rely on the broad information of the entire scene provided by global semantics and local texture information to predict the precise position and shape of the lane lines. In this paper, we propose a Global Semantic Enhancement Network for lane detection, which involves a complete set of systems for feature extraction and global features transmission. Traditional methods for global feature extraction usually require deep convolution layer stacks. However, this approach of obtaining global features solely through a larger receptive field not only fails to capture precise global features but also leads to an overly deep model, which results in slow inference speed. To address these challenges, we propose a novel operation called the Global feature Extraction Module (GEM). Additionally, we introduce the Top Layer Auxiliary Module (TLAM) as a channel for feature distillation, which facilitates a bottomup transmission of global features. Furthermore, we introduce two novel loss functions: the Angle Loss, which account for the angle between predicted and ground truth lanes, and the Generalized Line IoU Loss function that considers the scenarios where significant deviations occur between the prediction of lanes and ground truth in some harsh conditions. The experimental results reveal that the proposed method exhibits remarkable superiority over the current state-of-the-art techniques for lane detection.Our codes are available at:https://github.com/crystal250/GSENet.



Paperid:1677
Authors:Shangchao Su, Mingzhao Yang, Bin Li, Xiangyang Xue
Fudan University, Fudan University, Fudan University, Fudan University
Abstract:
Federated learning (FL) enables multiple clients to collaboratively train a global model without disclosing their data. Previous researches often require training the complete model parameters. However, the emergence of powerful pretrained models makes it possible to achieve higher performance with fewer learnable parameters in FL. In this paper, we propose a federated adaptive prompt tuning algorithm, FedAPT, for multi-domain collaborative image classification with powerful foundation models, like CLIP. Compared with direct federated prompt tuning, our core idea is to adaptively unlock specific domain knowledge for each test sample in order to provide them with personalized prompts. To implement this idea, we design an adaptive prompt tuning module, which consists of a meta prompt, an adaptive network, and some keys. The server randomly generates a set of keys and assigns a unique key to each client. Then all clients cooperatively train the global adaptive network and meta prompt with the local datasets and the frozen keys. Ultimately, the global aggregation model can assign a personalized prompt to CLIP based on the domain features of each test sample. We perform extensive experiments on two multi-domain image classification datasets across two different settings -- supervised and unsupervised. The results show that FedAPT can achieve better performance with less than 10% of the number of parameters of the fully trained model, and the global model can perform well in diverse client domains simultaneously.



Paperid:1678
Authors:Yongyi Su, Xun Xu, Kui Jia
South China University of Technology, Institute for Infocomm Research, A*STAR South China University of Technology, School of Data Science, The Chinese University of Hong Kong, Shenzhen
Abstract:
TestTime Adaptation aims to adapt source domain model to testing data at inference stage with success demonstrated in adapting to unseen corruptions. However, these attempts may fail under more challenging real-world scenarios. Existing works mainly consider real-world test-time adaptation under non-i.i.d. data stream and continual domain shift. In this work, we first complement the existing real-world TTA protocol with a globally class imbalanced testing set. We demonstrate that combining all settings together poses new challenges to existing methods. We argue the failure of state-of-the-art methods is first caused by indiscriminately adapting normalization layers to imbalanced testing data. To remedy this shortcoming, we propose a balanced batchnorm layer to swap out the regular batchnorm at inference stage. The new batchnorm layer is capable of adapting without biasing towards majority classes. We are further inspired by the success of self-training (ST) in learning from unlabeled data and adapt ST for test-time adaptation. However, ST alone is prone to over adaption which is responsible for the poor performance under continual domain shift. Hence, we propose to improve self-training under continual domain shift by regularizing model updates with an anchored loss. The final TTA model, termed as TRIBE, is built upon a tri-net architecture with balanced batchnorm layers. We evaluate TRIBE on four datasets representing real-world TTA settings. TRIBE consistently achieves the state-of-the-art performance across multiple evaluation protocols. The code is available at https://github.com/Gorilla-Lab-SCUT/TRIBE.



Paperid:1679
Authors:Zixian Su, Jingwei Guo, Kai Yao, Xi Yang, Qiufeng Wang, Kaizhu Huang
School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China Faculty of Science and Engineering, University of Liverpool, Liverpool, the United Kingdom, School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China Faculty of Science and Engineering, University of Liverpool, Liverpool, the United Kingdom, School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China Faculty of Science and Engineering, University of Liverpool, Liverpool, the United Kingdom, School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China, School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou, China, Data Science Research Center, Duke Kunshan University, Kunshan, China
Abstract:
While recent testtime adaptations exhibit efficacy by adjusting batch normalization to narrow domain disparities, their effectiveness diminishes with realistic mini-batches due to inaccurate target estimation. As previous attempts merely introduce source statistics to mitigate this issue, the fundamental problem of inaccurate target estimation still persists, leaving the intrinsic test-time domain shifts unresolved. This paper delves into the problem of mini-batch degradation. By unraveling batch normalization, we discover that the inexact target statistics largely stem from the substantially reduced class diversity in batch. Drawing upon this insight, we introduce a straightforward tool, Test-time Exponential Moving Average (TEMA), to bridge the class diversity gap between training and testing batches. Importantly, our TEMA adaptively extends the scope of typical methods beyond the current batch to incorporate a diverse set of class information, which in turn boosts an accurate target estimation. Built upon this foundation, we further design a novel layer-wise rectification strategy to consistently promote test-time performance. Our proposed method enjoys a unique advantage as it requires neither training nor tuning parameters, offering a truly hassle-free solution. It significantly enhances model robustness against shifted domains and maintains resilience in diverse real-world scenarios with various batch sizes, achieving state-of-the-art performance on several major benchmarks. Code is available at https://github.com/kiwi12138/RealisticTTA.



Paperid:1680
Authors:Chenyu Sun, Hangwei Qian, Chunyan Miao
Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University (NTU), Singapore School of Computer Science and Engineering, NTU, Singapore Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore, Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore, Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University (NTU), Singapore School of Computer Science and Engineering, NTU, Singapore Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore
Abstract:
Offline reinforcement learning (RL) aims to learn an effective policy from a precollected dataset. Most existing works are to develop sophisticated learning algorithms, with less emphasis on improving the data collection process. Moreover, it is even challenging to extend the single-task setting and collect a task-agnostic dataset that allows an agent to perform multiple downstream tasks. In this paper, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection and ultimately improve learning efficiency and capabilities for multi-task offline RL. To achieve this, CUDC estimates the probability of the k-step future states being reachable from the current states, and adapts how many steps into the future that the dynamics model should predict. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.



Paperid:1681
Authors:Chuxiong Sun, Zehua Zang, Jiabao Li, Jiangmeng Li, Xiao Xu, Rui Wang, Changwen Zheng
Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences State Key Laboratory of Intelligent Game, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences State Key Laboratory of Intelligent Game, State Key Laboratory of Intelligent Game, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences State Key Laboratory of Intelligent Game University of Chinese Academy of Sciences, Science and Technology on Integrated Information System Laboratory, Institute of Software Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Communication stands as a potent mechanism to harmonize the behaviors of multiple agents. However, existing work primarily concentrates on broadcast communication, which not only lacks practicality, but also leads to information redundancy. This surplus, onefits-all information could adversely impact the communication efficiency. Furthermore, existing works often resort to basic mechanisms to integrate observed and received information, impairing the learning process. To tackle these difficulties, we propose Targeted and Trusted Multi-Agent Communication (T2MAC), a straightforward yet effective method that enables agents to learn selective engagement and evidence-driven integration. With T2MAC, agents have the capability to craft individualized messages, pinpoint ideal communication windows, and engage with reliable partners, thereby refining communication efficiency. Following the reception of messages, the agents integrate information observed and received from different sources at an evidence level. This process enables agents to collectively use evidence garnered from multiple perspectives, fostering trusted and cooperative behaviors. We evaluate our method on a diverse set of cooperative multi-agent tasks, with varying difficulties, involving different scales and ranging from Hallway, MPE to SMAC. The experiments indicate that the proposed model not only surpasses the state-of-the-art methods in terms of cooperative performance and communication efficiency, but also exhibits impressive generalization.



Paperid:1682
Authors:Jianhui Sun, Xidong Wu, Heng Huang, Aidong Zhang
Computer Science, University of Virginia, VA, USA, Electrical and Computer Engineering, University of Pittsburgh, PA, USA, Computer Science, University of Maryland College Park, MD, USA, Computer Science, University of Virginia, VA, USA
Abstract:
Federated Averaging (FedAvg) is known to experience convergence issues when encountering significant clients system heterogeneity and data heterogeneity. Server momentum has been proposed as an effective mitigation. However, existing server momentum works are restrictive in the momentum formulation, do not properly schedule hyperparameters and focus only on system homogeneous settings, which leaves the role of server momentum still an underexplored problem. In this paper, we propose a general framework for server momentum, that (a) covers a large class of momentum schemes that are unexplored in federated learning (FL), (b) enables a popular stagewise hyperparameter scheduler, (c) allows heterogeneous and asynchronous local computing. We provide rigorous convergence analysis for the proposed framework. To our best knowledge, this is the first work that thoroughly analyzes the performances of server momentum with a hyperparameter scheduler and system heterogeneity. Extensive experiments validate the effectiveness of our proposed framework. Due to page limit, we leave all proofs to the full version https://arxiv.org/abs/2312.12670.



Paperid:1683
Authors:Jun Sun, Xinxin Zhang, Shoukang Han, Yu-Ping Ruan, Taihao Li
Institute of Artificial Intelligence, Zhejiang Lab, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Institute of Artificial Intelligence, Zhejiang Lab, Institute of Artificial Intelligence, Zhejiang Lab, Institute of Artificial Intelligence, Zhejiang Lab
Abstract:
Multimodal learning is susceptible to modality missing, which poses a major obstacle for its practical applications and, thus, invigorates increasing research interest. In this paper, we investigate two challenging problems: 1) when modality missing exists in the training data, how to exploit the incomplete samples while guaranteeing that they are properly supervised? 2) when the missing rates of different modalities vary, causing or exacerbating the imbalance among modalities, how to address the imbalance and ensure all modalities are welltrained. To tackle these two challenges, we first introduce the variational information bottleneck (VIB) method for the cross-modal representation learning of missing modalities, which capitalizes on the available modalities and the labels as supervision. Then, accounting for the imbalanced missing rates, we define relative advantage to quantify the advantage of each modality over others. Accordingly, a bi-level optimization problem is formulated to adaptively regulate the supervision of all modalities during training. As a whole, the proposed approach features Relative advantage aware Cross-modal representation learning (abbreviated as RedCore) for missing modalities with imbalanced missing rates. Extensive empirical results demonstrate that RedCore outperforms competing models in that it exhibits superior robustness against either large or imbalanced missing rates. The code is available at: https://github.com/sunjunaimer/RedCore.



Paperid:1684
Authors:Yuan Sun, Jian Dai, Zhenwen Ren, Yingke Chen, Dezhong Peng, Peng Hu
Sichuan University National Innovation Center for UHD Video Technology, Tsinghua University, Southwest University of Science and Technology, Northumbria University, Sichuan University National Innovation Center for UHD Video Technology, Sichuan University
Abstract:
Crossmodal hashing~(CMH) is an efficient technique to retrieve relevant data across different modalities, such as images, texts, and videos, which has attracted more and more attention due to its low storage cost and fast query speed. Although existing CMH methods achieve remarkable processes, almost all of them treat all samples of varying difficulty levels without discrimination, thus leaving them vulnerable to noise or outliers. Based on this observation, we reveal and study dual difficulty levels implied in cross-modal hashing learning, \ie instance-level and feature-level difficulty. To address this problem, we propose a novel Dual Self-Paced Cross-Modal Hashing (DSCMH) that mimics human cognitive learning to learn hashing from ``easy'' to ``hard'' in both instance and feature levels, thereby embracing robustness against noise/outliers. Specifically, our DSCMH assigns weights to each instance and feature to measure their difficulty or reliability, and then uses these weights to automatically filter out the noisy and irrelevant data points in the original space. By gradually increasing the weights during training, our method can focus on more instances and features from ``easy'' to ``hard'' in training, thus mitigating the adverse effects of noise or outliers. Extensive experiments are conducted on three widely-used benchmark datasets to demonstrate the effectiveness and robustness of the proposed DSCMH over 12 state-of-the-art CMH methods.



Paperid:1685
Authors:Yuewen Sun, Erli Wang, Biwei Huang, Chaochao Lu, Lu Feng, Changyin Sun, Kun Zhang
Mohamed bin Zayed University of Artificial Intelligence Carnegie Mellon University, NEC Labs, China, University of California San Diego, Shanghai AI Laboratory, NEC Labs, China, Anhui University, Mohamed bin Zayed University of Artificial Intelligence Carnegie Mellon University
Abstract:
Data augmentation plays a crucial role in improving the data efficiency of reinforcement learning (RL). However, the generation of highquality augmented data remains a significant challenge. To overcome this, we introduce ACAMDA (Adversarial Causal Modeling for Data Augmentation), a novel framework that integrates two causality-based tasks: causal structure recovery and counterfactual estimation. The unique aspect of ACAMDA lies in its ability to recover temporal causal relationships from limited non-expert datasets. The identification of the sequential cause-and-effect allows the creation of realistic yet unobserved scenarios. We utilize this characteristic to generate guided counterfactual datasets, which, in turn, substantially reduces the need for extensive data collection. By simulating various state-action pairs under hypothetical actions, ACAMDA enriches the training dataset for diverse and heterogeneous conditions. Our experimental evaluation shows that ACAMDA outperforms existing methods, particularly when applied to novel and unseen domains.



Paperid:1686
Authors:David Sychrovský, Michal Šustr, Elnaz Davoodi, Michael Bowling, Marc Lanctot, Martin Schmid
Charles University Czech Technical University, Czech Technical University EquiLibre Technologies, DeepMind, University of Alberta, DeepMind, Charles University EquiLibre Technologies
Abstract:
The literature on gametheoretic equilibrium finding predominantly focuses on single games or their repeated play. Nevertheless, numerous real-world scenarios feature playing a game sampled from a distribution of similar, but not identical games, such as playing poker with different public cards or trading correlated assets on the stock market. As these similar games feature similar equilibra, we investigate a way to accelerate equilibrium finding on such a distribution. We present a novel ``learning not to regret'' framework, enabling us to meta-learn a regret minimizer tailored to a specific distribution. Our key contribution, Neural Predictive Regret Matching, is uniquely meta-learned to converge rapidly for the chosen distribution of games, while having regret minimization guarantees on any game. We validated our algorithms' faster convergence on a distribution of river poker games. Our experiments show that the meta-learned algorithms outpace their non-meta-learned counterparts, achieving more than tenfold improvements.



Paperid:1687
Authors:Shoichiro Takeda, Yasunori Akagi, Naoki Marumo, Kenta Niwa
NTT Corporation, NTT Corporation, NTT Corporation, NTT Corporation
Abstract:
We propose novel fast algorithms for optimal transport (OT) utilizing a cyclic symmetry structure of input data. Such OT with cyclic symmetry appears universally in various realworld examples: image processing, urban planning, and graph processing. Our main idea is to reduce OT to a small optimization problem that has significantly fewer variables by utilizing cyclic symmetry and various optimization techniques. On the basis of this reduction, our algorithms solve the small optimization problem instead of the original OT. As a result, our algorithms obtain the optimal solution and the objective function value of the original OT faster than solving the original OT directly. In this paper, our focus is on two crucial OT formulations: the linear programming OT (LOT) and the strongly convex-regularized OT, which includes the well-known entropy-regularized OT (EROT). Experiments show the effectiveness of our algorithms for LOT and EROT in synthetic/real-world data that has a strict/approximate cyclic symmetry structure. Through theoretical and experimental results, this paper successfully introduces the concept of symmetry into the OT research field for the first time.



Paperid:1688
Authors:Cheng Tan, Zhangyang Gao, Lirong Wu, Jun Xia, Jiangbin Zheng, Xihong Yang, Yue Liu, Bozhen Hu, Stan Z. Li
Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, College of Computer, National University of Defense Technology, College of Computer, National University of Defense Technology, Zhejiang University AI Lab, Research Center for Industries of the Future, Westlake University, AI Lab, Research Center for Industries of the Future, Westlake University
Abstract:
Antibodies are crucial proteins produced by the immune system in response to foreign substances or antigens. The specificity of an antibody is determined by its complementaritydetermining regions (CDRs), which are located in the variable domains of the antibody chains and form the antigen-binding site. Previous studies have utilized complex techniques to generate CDRs, but they suffer from inadequate geometric modeling. Moreover, the common iterative refinement strategies lead to an inefficient inference. In this paper, we propose a simple yet effective model that can co-design 1D sequences and 3D structures of CDRs in a one-shot manner. To achieve this, we decouple the antibody CDR design problem into two stages: (i) geometric modeling of protein complex structures and (ii) sequence-structure co-learning. We develop a novel macromolecular structure invariant embedding, typically for protein complexes, that captures both intra- and inter-component interactions among the backbone atoms, including Calpha, N, C, and O atoms, to achieve comprehensive geometric modeling. Then, we introduce a simple cross-gate MLP for sequence-structure co-learning, allowing sequence and structure representations to implicitly refine each other. This enables our model to design desired sequences and structures in a one-shot manner. Extensive experiments are conducted to evaluate our results at both the sequence and structure level, which demonstrate that our model achieves superior performance compared to the state-of-the-art antibody CDR design methods.



Paperid:1689
Authors:Shanli Tan, Hao Cheng, Xiaohu Wu, Han Yu, Tiantian He, Yew Soon Ong, Chongjun Wang, Xiaofeng Tao
Beijing University of Posts and Telecommunications, Nanjing University, Beijing University of Posts and Telecommunications, Nanyang Technological University (NTU), Agency for Science, Technology and Research (A*STAR), Nanyang Technological University, Singapore A*STAR, Nanjing University, Beijing University of Posts and Telecommunications
Abstract:
Federated learning (FL) provides a privacypreserving approach for collaborative training of machine learning models. Given the potential data heterogeneity, it is crucial to select appropriate collaborators for each FL participant (FL-PT) based on data complementarity. Recent studies have addressed this challenge. Similarly, it is imperative to consider the inter-individual relationships among FL-PTs where some FL-PTs engage in competition. Although FL literature has acknowledged the significance of this scenario, practical methods for establishing FL ecosystems remain largely unexplored. In this paper, we extend a principle from the balance theory, namely “the friend of my enemy is my enemy”, to ensure the absence of conflicting interests within an FL ecosystem. The extended principle and the resulting problem are formulated via graph theory and integer linear programming. A polynomial-time algorithm is proposed to determine the collaborators of each FL-PT. The solution guarantees high scalability, allowing even competing FL-PTs to smoothly join the ecosystem without conflict of interest. The proposed framework jointly considers competition and data heterogeneity. Extensive experiments on real-world and synthetic data demonstrate its efficacy compared to five alternative approaches, and its ability to establish efficient collaboration networks among FL-PTs.



Paperid:1690
Authors:Wei Tan, Ngoc Dang Nguyen, Lan Du, Wray Buntine
Monash University, Monash University, Monash University, VinUniversity
Abstract:
Within the scope of natural language processing, the domain of multilabel text classification is uniquely challenging due to its expansive and uneven label distribution. The complexity deepens due to the demand for an extensive set of annotated data for training an advanced deep learning model, especially in specialized fields where the labeling task can be labor-intensive and often requires domain-specific knowledge. Addressing these challenges, our study introduces a novel deep active learning strategy, capitalizing on the Beta family of proper scoring rules within the Expected Loss Reduction framework. It computes the expected increase in scores using the Beta Scoring Rules, which are then transformed into sample vector representations. These vector representations guide the diverse selection of informative sample, directly linking this process to the model's expected proper score. Comprehensive evaluations across both synthetic and real datasets reveal our method's capability to often outperform established acquisition techniques in multi-label text classification, presenting encouraging outcomes across various architectural and dataset scenarios.



Paperid:1691
Authors:Xin Tan, Ce Zhao, Chengliang Liu, Jie Wen, Zhanyan Tang
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen
Abstract:
Recently, multiview multi-label classification (MvMLC) has received a significant amount of research interest and many methods have been proposed based on the assumptions of view completion and label completion. However, in real-world scenarios, multi-view multi-label data tends to be incomplete due to various uncertainties involved in data collection and manual annotation. As a result, the conventional MvMLC methods fail. In this paper, we propose a new two-stage MvMLC network to solve this incomplete MvMLC issue with partial missing views and missing labels. Different from the existing works, our method attempts to leverage the diverse information from the partially missing data based on the information theory. Specifically, our method aims to minimize task-irrelevant information while maximizing task-relevant information through the principles of information bottleneck theory and mutual information extraction. The first stage of our network involves training view-specific classifiers to concentrate the task-relevant information. Subsequently, in the second stage, the hidden states of these classifiers serve as input for an alignment model, an autoencoder-based mutual information extraction framework, and a weighted fusion classifier to make the final prediction. Extensive experiments performed on five datasets validate that our method outperforms other state-of-the-art methods. Code is available at https://github.com/KevinTan10/TSIEN.



Paperid:1692
Authors:Yuze Tan, Hecheng Cai, Shudong Huang, Shuping Wei, Fan Yang, Jiancheng Lv
Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Nuclear Power Institute of China, Sichuan University Sichuan IotDT Technology Co., Ltd., Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education
Abstract:
The significance of multiview learning in effectively mitigating the intricate intricacies entrenched within heterogeneous data has garnered substantial attention in recent years. Notwithstanding the favorable achievements showcased by recent strides in this area, a confluence of noteworthy challenges endures. To be specific, a majority of extant methodologies unceremoniously assign weights to data points view-wisely. This ineluctably disregards the intrinsic reality that disparate views confer diverse contributions to each individual sample, consequently neglecting the rich wellspring of sample-level structural insights harbored within the dataset. In this paper, we proposed an effective Augmented Lagrangian MethOd for fiNe-graineD (ALMOND) multi-view optimization. This innovative approach scrutinizes the interplay among multiple views at the granularity of individual samples, thereby fostering the enhanced preservation of local structural coherence. The Augmented Lagrangian Method (ALM) is elaborately incorporated into our framework, which enables us to achieve an optimal solution without involving an inexplicable intermediate variable as previous methods do. Empirical experiments on multi-view clustering tasks across heterogeneous datasets serve to incontrovertibly showcase the effectiveness of our proposed methodology, corroborating its preeminence over incumbent state-of-the-art alternatives.



Paperid:1693
Authors:Boshi Tang, Zhiyong Wu, Xixin Wu, Qiaochu Huang, Jun Chen, Shun Lei, Helen Meng
Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, The Chinese University of Hong Kong, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, Shenzhen International Graduate School, Tsinghua University, The Chinese University of Hong Kong
Abstract:
Graph neural networks (GNNs) have exhibited impressive performance in modeling graph data as exemplified in various applications. Recently, the GNN calibration problem has attracted increasing attention, especially in costsensitive scenarios. Previous work has gained empirical insights on the issue, and devised effective approaches for it, but theoretical supports still fall short. In this work, we shed light on the relationship between GNN calibration and nodewise similarity via theoretical analysis. A novel calibration framework, named SimCalib, is accordingly proposed to consider similarity between nodes at global and local levels. At the global level, the Mahalanobis distance between the current node and class prototypes is integrated to implicitly consider similarity between the current node and all nodes in the same class. At the local level, the similarity of node representation movement dynamics, quantified by nodewise homophily and relative degree, is considered. Informed about the application of nodewise movement patterns in analyzing nodewise behavior on the over-smoothing problem, we empirically present a possible relationship between over-smoothing and GNN calibration problem. Experimentally, we discover a correlation between nodewise similarity and model calibration improvement, in alignment with our theoretical results. Additionally, we conduct extensive experiments investigating different design factors and demonstrate the effectiveness of our proposed SimCalib framework for GNN calibration by achieving state-of-the-art performance on 14 out of 16 benchmarks.



Paperid:1694
Authors:Qiaoyue Tang, Frederick Shpilevskiy, Mathias Lécuyer
University of British Columbia, University of British Columbia, University of British Columbia
Abstract:
The Adam optimizer is a popular choice in contemporary deep learning due to its strong empirical performance. However we observe that in privacy sensitive scenarios, the traditional use of Differential Privacy (DP) with the Adam optimizer leads to suboptimal performance on several tasks. We find that this performance degradation is due to a DP bias in Adam's second moment estimator, introduced by the addition of independent noise in the gradient computation to enforce DP guarantees. This DP bias leads to a different scaling for low variance parameter updates, that is inconsistent with the behavior of non-private Adam, and Adam's sign descent interpretation. We propose the DP-AdamBC optimization algorithm, which corrects for the bias in the second moment estimation and retrieves the expected behaviour of Adam. Empirically, DP-AdamBC significantly improves the optimization performance of DP-Adam by up to 3.5% in final accuracy in image, text, and graph node classification tasks.



Paperid:1695
Authors:Shaojie Tang, Jing Yuan
University of Texas at Dallas, University of North Texas
Abstract:
In this paper, we study a fundamental problem in submodular optimization known as sequential submodular maximization. The primary objective of this problem is to select and rank a sequence of items to optimize a group of submodular functions. The existing research on this problem has predominantly concentrated on the monotone setting, assuming that the submodular functions are nondecreasing. However, in various real-world scenarios, like diversity-aware recommendation systems, adding items to an existing set might negatively impact the overall utility. In response, we propose to study this problem with non-monotone submodular functions and develop approximation algorithms for both flexible and fixed length constraints, as well as a special case with identical utility functions. The empirical evaluations further validate the effectiveness of our proposed algorithms in the domain of video recommendations.



Paperid:1696
Authors:Zhenchao Tang, Jiehui Huang, Guanxing Chen, Calvin Yu-Chian Chen
School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University AI for Science (AI4S) - Preferred Program, Peking University Shenzhen Graduate School School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School Department of Medical Research, China Medical University Hospital Department of Bioinformatics and Medical Engineering, Asia University
Abstract:
Motivation: Advances in singlecell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a deep learning method for unsupervised integration of single-cell multimodal data. CoVEL learns single-cell representations from a comprehensive view, including regulatory relationships between modalities, fine-grained representations of cells, and relationships between different cells. The comprehensive view embedding enables CoVEL to remove the gap between modalities while protecting biological heterogeneity. Experimental results on multiple public datasets show that CoVEL is accurate and robust to single-cell multimodal integration. Data availability: https://github.com/shapsider/scintegration.



Paperid:1697
Authors:Zhiwei Tang, Yanmeng Wang, Tsung-Hui Chang
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China Shenzhen Research Institute of Big Data, Shenzhen, China, School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China Shenzhen Research Institute of Big Data, Shenzhen, China, School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China Shenzhen Research Institute of Big Data, Shenzhen, China
Abstract:
Federated Learning (FL) is a promising privacypreserving distributed learning paradigm but suffers from high communi- cation cost when training large-scale machine learning models. Sign-based methods, such as SignSGD, have been proposed as a biased gradient compression technique for reducing the communication cost. However, sign-based algorithms could diverge under heterogeneous data, which thus motivated the de- velopment of advanced techniques, such as the error-feedback method and stochastic sign-based compression, to fix this issue. Nevertheless, these methods still suffer from slower convergence rates, and none of them allows multiple local SGD updates like FedAvg. In this paper, we propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression, which not only al- lows one to flexibly control the bias-variance tradeoff for the compressed gradient, but also provides a unified viewpoint to existing stochastic sign-based methods. More importantly, the proposed scheme enables the development of the very first sign-based FedAvg algorithm (z-SignFedAvg) to accelerate the convergence. Theoretically, we show that z-SignFedAvg achieves a faster convergence rate than existing sign-based methods and, under the uniformly distributed noise, can enjoy the same convergence rate as its uncompressed counterpart. Extensive experiments are conducted to demonstrate that the z-SignFedAvg can achieve competitive empirical performance on real datasets and outperforms existing schemes.



Paperid:1698
Authors:Lue Tao, Yu-Xuan Huang, Wang-Zhou Dai, Yuan Jiang
National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Intelligence Science and Technology, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Abstract:
Neurosymbolic hybrid systems are promising for integrating machine learning and symbolic reasoning, where perception models are facilitated with information inferred from a symbolic knowledge base through logical reasoning. Despite empirical evidence showing the ability of hybrid systems to learn accurate perception models, the theoretical understanding of learnability is still lacking. Hence, it remains unclear why a hybrid system succeeds for a specific task and when it may fail given a different knowledge base. In this paper, we introduce a novel way of characterising supervision signals from a knowledge base, and establish a criterion for determining the knowledge’s efficacy in facilitating successful learning. This, for the first time, allows us to address the two questions above by inspecting the knowledge base under investigation. Our analysis suggests that many knowledge bases satisfy the criterion, thus enabling effective learning, while some fail to satisfy it, indicating potential failures. Comprehensive experiments confirm the utility of our criterion on benchmark tasks.



Paperid:1699
Authors:Zerui Tao, Toshihisa Tanaka, Qibin Zhao
Tokyo University of Agriculture and Technology RIKEN AIP, Tokyo University of Agriculture and Technology RIKEN AIP, RIKEN AIP Tokyo University of Agriculture and Technology
Abstract:
In numerous applications, binary reactions or event counts are observed and stored within highorder tensors. Tensor decompositions (TDs) serve as a powerful tool to handle such high-dimensional and sparse data. However, many traditional TDs are explicitly or implicitly designed based on the Gaussian distribution, which is unsuitable for discrete data. Moreover, most TDs rely on predefined multi-linear structures, such as CP and Tucker formats. Therefore, they may not be effective enough to handle complex real-world datasets. To address these issues, we propose ENTED, an Efficient Nonparametric TEnsor Decomposition for binary and count tensors. Specifically, we first employ a nonparametric Gaussian process (GP) to replace traditional multi-linear structures. Next, we utilize the Pólya-Gamma augmentation which provides a unified framework to establish conjugate models for binary and count distributions. Finally, to address the computational issue of GPs, we enhance the model by incorporating sparse orthogonal variational inference of inducing points, which offers a more effective covariance approximation within GPs and stochastic natural gradient updates for nonparametric models. We evaluate our model on several real-world tensor completion tasks, considering binary and count datasets. The results manifest both better performance and computational advantages of the proposed model.



Paperid:1700
Authors:Yuki Tatsunami, Masato Taki
Rikkyo University AnyTech Co., Ltd., Rikkyo University
Abstract:
Multihead-self-attention (MHSA)-equipped models have achieved notable performance in computer vision. Their computational complexity is proportional to quadratic numbers of pixels in input feature maps, resulting in slow processing, especially when dealing with high-resolution images. New types of token-mixer are proposed as an alternative to MHSA to circumvent this problem: an FFT-based token-mixer involves global operations similar to MHSA but with lower computational complexity. However, despite its attractive properties, the FFT-based token-mixer has not been carefully examined in terms of its compatibility with the rapidly evolving MetaFormer architecture. Here, we propose a novel token-mixer called Dynamic Filter and novel image recognition models, DFFormer and CDFFormer, to close the gaps above. The results of image classification and downstream tasks, analysis, and visualization show that our models are helpful. Notably, their throughput and memory efficiency when dealing with high-resolution image recognition is remarkable. Our results indicate that Dynamic Filter is one of the token-mixer options that should be seriously considered. The code is available at https://github.com/okojoalg/dfformer



Paperid:1701
Authors:Samuel Teuber, Bernhard Beckert
Karlsruhe Institute of Technology, Karlsruhe Institute of Technology
Abstract:
This work presents insights gained by investigating the relationship between algorithmic fairness and the concept of secure information flow. The problem of enforcing secure information flow is wellstudied in the context of information security: If secret information may "flow" through an algorithm or program in such a way that it can influence the program’s output, then that is considered insecure information flow as attackers could potentially observe (parts of) the secret. There is a strong correspondence between secure information flow and algorithmic fairness: if protected attributes such as race, gender, or age are treated as secret program inputs, then secure information flow means that these "secret" attributes cannot influence the result of a program. While most research in algorithmic fairness evaluation concentrates on studying the impact of algorithms (often treating the algorithm as a black-box), the concepts derived from information flow can be used both for the analysis of disparate treatment as well as disparate impact w.r.t. a structural causal model. In this paper, we examine the relationship between quantitative as well as qualitative information-flow properties and fairness. Moreover, based on this duality, we derive a new quantitative notion of fairness called fairness spread, which can be easily analyzed using quantitative information flow and which strongly relates to counterfactual fairness. We demonstrate that off-the-shelf tools for information-flow properties can be used in order to formally analyze a program's algorithmic fairness properties, including the new notion of fairness spread as well as established notions such as demographic parity.



Paperid:1702
Authors:Jidapa Thadajarassiri, Walter Gerych, Xiangnan Kong, Elke Rundensteiner
Srinakharinwirot University, Worcester Polytechnic Institute, Worcester Polytechnic Institute, Worcester Polytechnic Institute
Abstract:
Multitask learning (MTL) is essential for real-world applications that handle multiple tasks simultaneously, such as selfdriving cars. MTL methods improve the performance of all tasks by utilizing information across tasks to learn a robust shared representation. However, acquiring sufficient labeled data tends to be extremely expensive, especially when having to support many tasks. Recently, Knowledge Amalgamation (KA) has emerged as an effective strategy for addressing the lack of labels by instead learning directly from pretrained models (teachers). KA learns one unified multi-task student that masters all tasks across all teachers. Existing KA for MTL works are limited to teachers with identical architectures, and thus propose layer-to-layer based approaches. Unfortunately, in practice, teachers may have heterogeneous architectures; their layers may not be aligned and their dimensionalities or scales may be incompatible. Amalgamating multi-task teachers with heterogeneous architectures remains an open problem. For this, we design Versatile Common Feature Consolidator (VENUS), the first solution to this problem. VENUS fuses knowledge from the shared representations of each teacher into one unified generalized representation for all tasks. Specifically, we design the Feature Consolidator network that leverages an array of teacher-specific trainable adaptors. These adaptors enable the student to learn from multiple teachers, even if they have incompatible learned representations. We demonstrate that VENUS outperforms five alternative methods on numerous benchmark datasets across a broad spectrum of experiments.



Paperid:1703
Authors:Brandon Theodorou, Shrusti Jain, Cao Xiao, Jimeng Sun
University of Illinois at Urbana-Champaign Medisyn Inc., University of Illinois at Urbana-Champaign, GE Healthcare, University of Illinois at Urbana-Champaign Medisyn Inc.
Abstract:
Generative models can produce synthetic patient records for analytical tasks when real data is unavailable or limited. However, current methods struggle with adhering to domainspecific knowledge and removing invalid data. We present ConSequence, an effective approach to integrating domain knowledge into sequential generative neural network outputs. Our rule-based formulation includes temporal aggregation and antecedent evaluation modules, ensured by an efficient matrix multiplication formulation, to satisfy hard and soft logical constraints across time steps. Existing constraint methods often fail to guarantee constraint satisfaction, lack the ability to handle temporal constraints, and hinder the learning and computational efficiency of the model. In contrast, our approach efficiently handles all types of constraints with guaranteed logical coherence. We demonstrate ConSequence's effectiveness in generating electronic health records, outperforming competitors in achieving complete temporal and spatial constraint satisfaction without compromising runtime performance or generative quality. Specifically, ConSequence successfully prevents all rule violations while improving the model quality in reducing its test perplexity by 5% and incurring less than a 13% slowdown in generation speed compared to an unconstrained model.



Paperid:1704
Authors:Jinhao Tian, Zuchao Li, Jiajia Li, Ping Wang
Wuhan University, Wuhan University, Wuhan University, Wuhan University
Abstract:
The first step to apply deep learning techniques for symbolic music understanding is to transform musical pieces (mainly in MIDI format) into sequences of predefined tokens like note pitch, note velocity, and chords. Subsequently, the sequences are fed into a neural sequence model to accomplish specific tasks. Music sequences exhibit strong correlations between adjacent elements, making them prime candidates for Ngram techniques from Natural Language Processing (NLP). Consider classical piano music: specific melodies might recur throughout a piece, with subtle variations each time. In this paper, we propose a novel method, NG-Midiformer, for understanding symbolic music sequences that leverages the N-gram approach. Our method involves first processing music pieces into word-like sequences with our proposed unsupervised compoundation, followed by using our N-gram Transformer encoder, which can effectively incorporate N-gram information to enhance the primary encoder part for better understanding of music sequences. The pre-training process on large-scale music datasets enables the model to thoroughly learn the N-gram information contained within music sequences, and subsequently apply this information for making inferences during the fine-tuning stage. Experiment on various datasets demonstrate the effectiveness of our method and achieved state-of-the-art performance on a series of music understanding downstream tasks. The code and model weights will be released at https://github.com/CinqueOrigin/NG-Midiformer.



Paperid:1705
Authors:Xiao Tian, Rachael Hwee Ling Sim, Jue Fan, Bryan Kian Hsiang Low
National University of Singapore, National University of Singapore, National University of Singapore, National University of Singapore
Abstract:
Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher values to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to RiskDeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions.



Paperid:1706
Authors:Quang Truong, Peter Chin
Dartmouth College, Dartmouth College
Abstract:
Graph Neural Networks (GNNs), despite achieving remarkable performance across different tasks, are theoretically bounded by the 1Weisfeiler-Lehman test, resulting in limitations in terms of graph expressivity. Even though prior works on topological higher-order GNNs overcome that boundary, these models often depend on assumptions about sub-structures of graphs. Specifically, topological GNNs leverage the prevalence of cliques, cycles, and rings to enhance the message-passing procedure. Our study presents a novel perspective by focusing on simple paths within graphs during the topological message-passing process, thus liberating the model from restrictive inductive biases. We prove that by lifting graphs to path complexes, our model can generalize the existing works on topology while inheriting several theoretical results on simplicial complexes and regular cell complexes. Without making prior assumptions about graph sub-structures, our method outperforms earlier works in other topological domains and achieves state-of-the-art results on various benchmarks.



Paperid:1707
Authors:Wenxuan Tu, Renxiang Guan, Sihang Zhou, Chuan Ma, Xin Peng, Zhiping Cai, Zhe Liu, Jieren Cheng, Xinwang Liu
National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, Zhejiang Lab, National University of Defense Technology, National University of Defense Technology, Zhejiang Lab, Hainan university Hainan Blockchain Technology Engineering Research Center, National University of Defense Technology
Abstract:
Deep clustering with attributemissing graphs, where only a subset of nodes possesses complete attributes while those of others are missing, is an important yet challenging topic in various practical applications. It has become a prevalent learning paradigm in existing studies to perform data imputation first and subsequently conduct clustering using the imputed information. However, these ``two-stage" methods disconnect the clustering and imputation processes, preventing the model from effectively learning clustering-friendly graph embedding. Furthermore, they are not tailored for clustering tasks, leading to inferior clustering results. To solve these issues, we propose a novel Attribute-Missing Graph Clustering (AMGC) method to alternately promote clustering and imputation in a unified framework, where we iteratively produce the clustering-enhanced nearest neighbor information to conduct the data imputation process and utilize the imputed information to implicitly refine the clustering distribution through model optimization. Specifically, in the imputation step, we take the learned clustering information as imputation prompts to help each attribute-missing sample gather highly correlated features within its clusters for data completion, such that the intra-class compactness can be improved. Moreover, to support reliable clustering, we maximize inter-class separability by conducting cost-efficient dual non-contrastive learning over the imputed latent features, which in turn promotes greater graph encoding capability for clustering sub-network. Extensive experiments on five datasets have verified the superiority of AMGC against competitors.



Paperid:1708
Authors:Théo Vincent, Alberto Maria Metelli, Boris Belousov, Jan Peters, Marcello Restelli, Carlo D'Eramo
German Research Center for AI (DFKI) TU Darmstadt, Politecnico di Milano, German Research Center for AI (DFKI), German Research Center for AI (DFKI) Department of Computer Science, TU Darmstadt Hessian.ai Centre for Cognitive Science, TU Darmstadt, Politecnico di Milano, TU Darmstadt Hessian.ai CAIDAS, University of Würzburg
Abstract:
Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decisionmaking problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems.



Paperid:1709
Authors:Kiet Q. H. Vo, Muneeb Aadil, Siu Lun Chau, Krikamol Muandet
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany Saarland University, Saarbrücken, Germany, CISPA Helmholtz Center for Information Security, Saarbrücken, Germany Saarland University, Saarbrücken, Germany, CISPA Helmholtz Center for Information Security, Saarbrücken, Germany, CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Abstract:
We study the problem of agent selection in causal strategic learning under multiple decision makers and address two key challenges that come with it. Firstly, while much of prior work focuses on studying a fixed pool of agents that remains static regardless of their evaluations, we consider the impact of selection procedure by which agents are not only evaluated, but also selected. When each decision maker unilaterally selects agents by maximising their own utility, we show that the optimal selection rule is a tradeoff between selecting the best agents and providing incentives to maximise the agents' improvement. Furthermore, this optimal selection rule relies on incorrect predictions of agents' outcomes. Hence, we study the conditions under which a decision maker's optimal selection rule will not lead to deterioration of agents' outcome nor cause unjust reduction in agents' selection chance. To that end, we provide an analytical form of the optimal selection rule and a mechanism to retrieve the causal parameters from observational data, under certain assumptions on agents' behaviour. Secondly, when there are multiple decision makers, the interference between selection rules introduces another source of biases in estimating the underlying causal parameters. To address this problem, we provide a cooperative protocol which all decision makers must collectively adopt to recover the true causal parameters. Lastly, we complement our theoretical results with simulation studies. Our results highlight not only the importance of causal modeling as a strategy to mitigate the effect of gaming, as suggested by previous work, but also the need of a benevolent regulator to enable it.



Paperid:1710
Authors:Leonie von Wahl, Niklas Heidenreich, Prasenjit Mitra, Michael Nolting, Nicolas Tempelmeier
Volkswagen Group, Volkswagen Group, Penn State University, State College, Volkswagen Group, Volkswagen Group
Abstract:
Predictive maintenance has emerged as a critical application in modern transportation, leveraging sensor data to forecast potential damages proactively using machine learning. However, privacy concerns limit data sharing, making Federated learning an appealing approach to preserve data privacy. Nevertheless, challenges arise due to disparities in data distribution and temporal unavailability caused by individual usage patterns in transportation. In this paper, we present a novel asynchronous federated learning approach to address system heterogeneity and facilitate machine learning for predictive maintenance on transportation fleets. The approach introduces a novel data disparity aware aggregation scheme and a federated early stopping method for training. To validate the effectiveness of our approach, we evaluate it on two independent realworld datasets from the transportation domain: 1) oil dilution prediction of car combustion engines and 2) remaining lifetime prediction of plane turbofan engines. Our experiments show that we reliably outperform five state-of-the-art baselines, including federated and classical machine learning models. Moreover, we show that our approach generalises to various prediction model architectures.



Paperid:1711
Authors:Guancheng Wan, Wenke Huang, Mang Ye
National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence Hubei Key Laboratory of Multimedia and Network Communication Engineering School of Computer Science, Hubei Luojia Laboratory, Wuhan University, Wuhan, China, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence Hubei Key Laboratory of Multimedia and Network Communication Engineering School of Computer Science, Hubei Luojia Laboratory, Wuhan University, Wuhan, China, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence Hubei Key Laboratory of Multimedia and Network Communication Engineering School of Computer Science, Hubei Luojia Laboratory, Wuhan University, Wuhan, China
Abstract:
Federated Graph Learning is a privacypreserving collaborative approach for training a shared model on graph-structured data in the distributed environment. However, in real-world scenarios, the client graph data usually originate from diverse domains, this unavoidably hinders the generalization performance of the final global model. To address this challenge, we start the first attempt to investigate this scenario by learning a well-generalizable model. In order to improve the performance of the global model from different perspectives, we propose a novel framework called Federated Graph Learning with Generalizable Prototypes (FGGP). It decouples the global model into two levels and bridges them via prototypes. These prototypes, which are semantic centers derived from the feature extractor, can provide valuable classification information. At the classification model level, we innovatively eschew the traditional classifiers, then instead leverage clustered prototypes to capture fruitful domain information and enhance the discriminative capability of the classes, improving the performance of multi-domain predictions. Furthermore, at the feature extractor level, we go beyond traditional approaches by implicitly injecting distinct global knowledge and employing contrastive learning to obtain more powerful prototypes while enhancing the feature extractor generalization ability. Experimental results on various datasets are presented to validate the effectiveness of the proposed method.



Paperid:1712
Authors:Wenhai Wan, Xinrui Wang, Ming-Kun Xie, Shao-Yuan Li, Sheng-Jun Huang, Songcan Chen
Nanjing University of Aeronautics and Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing University of Aeronautics and Astronautics
Abstract:
Learning from noisy data has attracted much attention, where most methods focus on closedset label noise. However, a more common scenario in the real world is the presence of both open-set and closed-set noise. Existing methods typically identify and handle these two types of label noise separately by designing a specific strategy for each type. However, in many real-world scenarios, it would be challenging to identify open-set examples, especially when the dataset has been severely corrupted. Unlike the previous works, we explore how models behave when faced with open-set examples, and find that a part of open-set examples gradually get integrated into certain known classes, which is beneficial for the separation among known classes. Motivated by the phenomenon, we propose a novel two-step contrastive learning method CECL (Class Expansion Contrastive Learning) which aims to deal with both types of label noise by exploiting the useful information of open-set examples. Specifically, we incorporate some open-set examples into closed-set classes to enhance performance while treating others as delimiters to improve representative ability. Extensive experiments on synthetic and real-world datasets with diverse label noise demonstrate the effectiveness of CECL.



Paperid:1713
Authors:Bingzheng Wang, Guoqiang Wu, Teng Pang, Yan Zhang, Yilong Yin
Shandong University, Shandong University, Shandong University, Shandong University, Shandong University
Abstract:
Imitation learning aims to solve the problem of defining reward functions in realworld decision-making tasks. The current popular approach is the Adversarial Imitation Learning (AIL) framework, which matches expert state-action occupancy measures to obtain a surrogate reward for forward reinforcement learning. However, the traditional discriminator is a simple binary classifier and doesn't learn an accurate distribution, which may result in failing to identify expert-level state-action pairs induced by the policy interacting with the environment. To address this issue, we propose a method named diffusion adversarial imitation learning (DiffAIL), which introduces the diffusion model into the AIL framework. Specifically, DiffAIL models the state-action pairs as unconditional diffusion models and uses diffusion loss as part of the discriminator's learning objective, which enables the discriminator to capture better expert demonstrations and improve generalization. Experimentally, the results show that our method achieves state-of-the-art performance and significantly surpasses expert demonstration on two benchmark tasks, including the standard state-action setting and state-only settings.



Paperid:1714
Authors:Bowen Wang, Chen Liang, Jiaze Wang, Jiezhong Qiu, Furui Liu, Shaogang Hao, Dong Li, Guangyong Chen, Xiaolong Zou, Pheng Ann Heng
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shenzhen Geim Graphene Center, Institute of Materials Research, Tsinghua Shenzhen International Graduate School, Tsinghua University, Department of Computer Science and Engineering, The Chinese University of Hong Kong, Zhejiang Lab, Zhejiang Lab, Tencent, Huawei Noah's Ark Lab, Zhejiang Lab, Shenzhen Geim Graphene Center, Institute of Materials Research, Tsinghua Shenzhen International Graduate School, Tsinghua University, Department of Computer Science and Engineering, The Chinese University of Hong Kong Zhejiang Lab
Abstract:
Attaining the equilibrium geometry of a catalystadsorbate system is key to fundamentally assessing its effective properties, such as adsorption energy. While machine learning methods with advanced representation or supervision strategies have been applied to boost and guide the relaxation processes of catalysis systems, existing methods that produce linearly aggregated geometry predictions are susceptible to edge representations ambiguity, and are therefore vulnerable to graph variations. In this paper, we present a novel graph neural network (GNN) supervision and prediction strategy DR-Label. Our approach mitigates the multiplicity of solutions in edge representation and encourages model predictions that are independent of graph structural variations. DR-Label first Deconstructs finer-grained equilibrium state information to the model by projecting the node-level supervision signal to each edge. Reversely, the model Reconstructs a more robust equilibrium state prediction by converting edge-level predictions to node-level via a sphere-fitting algorithm. When applied to three fundamentally different models, DR-Label consistently enhanced performance. Leveraging the graph structure invariance of the DR-Label strategy, we further propose DRFormer, which applied explicit intermediate positional update and achieves a new state-of-the-art performance on the Open Catalyst 2020 (OC20) dataset and the Cu-based single-atom alloys CO adsorption (SAA) dataset. We expect our work to highlight vital principles for advancing geometric GNN models for catalysis systems and beyond. Our code is available at https://github.com/bowenwang77/DR-Label



Paperid:1715
Authors:Fangyikang Wang, Huminhao Zhu, Chao Zhang, Hanbin Zhao, Hui Qian
College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University, College of Computer Science and Technology, Zhejiang University Advanced Technology Institute, Zhejiang University, College of Computer Science and Technology, Zhejiang University Advanced Technology Institute, Zhejiang University, College of Computer Science and Technology, Zhejiang University State Key Lab of CAD&CG, Zhejiang University
Abstract:
Particlebased Variational Inference (ParVI) methods approximate the target distribution by iteratively evolving finite weighted particle systems. Recent advances of ParVI methods reveal the benefits of accelerated position update strategies and dynamic weight adjustment approaches. In this paper, we propose the first ParVI framework that possesses both accelerated position update and dynamical weight adjustment simultaneously, named the General Accelerated Dynamic-Weight Particle-based Variational Inference (GAD-PVI) framework. Generally, GAD-PVI simulates the semi-Hamiltonian gradient flow on a novel Information-Fisher-Rao space, which yields an additional decrease on the local functional dissipation. GAD-PVI is compatible with different dissimilarity functionals and associated smoothing approaches under three information metrics. Experiments on both synthetic and real-world data demonstrate the faster convergence and reduced approximation error of GAD-PVI methods over the state-of-the-art.



Paperid:1716
Authors:Guiqin Wang, Peng Zhao, Yanjiang Shi, Cong Zhao, Shusen Yang
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Knowledge distillation (KD), a technique widely employed in computer vision, has emerged as a de facto standard for improving the performance of small neural networks. However, prevailing KDbased approaches in video tasks primarily focus on designing loss functions and fusing cross-modal information. This overlooks the spatial-temporal feature semantics, resulting in limited advancements in model compression. Addressing this gap, our paper introduces an innovative knowledge distillation framework, with the generative model for training a lightweight student model. In particular, the framework is organized into two steps: the initial phase is Feature Representation, wherein a generative model-based attention module is trained to represent feature semantics; Subsequently, the Generative-based Feature Distillation phase encompasses both Generative Distillation and Attention Distillation, with the objective of transferring attention-based feature semantics with the generative model. The efficacy of our approach is demonstrated through comprehensive experiments on diverse popular datasets, proving considerable enhancements in video action recognition task. Moreover, the effectiveness of our proposed framework is validated in the context of more intricate video action detection task. Our code is available at https://github.com/aaai-24/Generative-based-KD.



Paperid:1717
Authors:Hao Wang, Shengda Luo, Guosheng Hu, Jianguo Zhang
Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China, Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China, Oosto, BT1 2BE, Belfast, Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China Peng cheng Lab, Shenzhen, China
Abstract:
Multimodal learning with incomplete input data (missing modality) is very practical and challenging. In this work, we conduct an indepth analysis of this challenge and find that modality dominance has a significant negative impact on the model training, greatly degrading the missing modality performance. Motivated by Grad-CAM, we introduce a novel indicator, gradients, to monitor and reduce modality dominance which widely exists in the missing-modality scenario. In aid of this indicator, we present a novel Gradient-guided Modality Decoupling (GMD) method to decouple the dependency on dominating modalities. Specifically, GMD removes the conflicted gradient components from different modalities to achieve this decoupling, significantly improving the performance. In addition, to flexibly handle modal-incomplete data, we design a parameter-efficient Dynamic Sharing (DS) framework which can adaptively switch on/off the network parameters based on whether one modality is available. We conduct extensive experiments on three popular multimodal benchmarks, including BraTS 2018 for medical segmentation, CMU-MOSI, and CMU-MOSEI for sentiment analysis. The results show that our method can significantly outperform the competitors, showing the effectiveness of the proposed solutions. Our code is released here: https://github.com/HaoWang420/Gradient-guided-Modality-Decoupling.



Paperid:1718
Authors:Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai
University of Sydney, Dolby Laboratories, Dolby Laboratories, Dolby Laboratories, University of Sydney
Abstract:
Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in crossmodal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively. Supplementary materials such as audio samples are provided at our demo website: https://v2a-mapper.github.io/.



Paperid:1719
Authors:Jia Wang, Wuqiang Su, Zushu Huang, Jie Chen, Chengwen Luo, Jianqiang Li
Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University, Shenzhen University
Abstract:
The MachineLearning-as-a-Service (MLaaS) framework allows one to grab low-hanging fruit of machine learning techniques and data science, without either much expertise for this sophisticated sphere or provision of specific infrastructures. However, the requirement of revealing all training data to the service provider raises new concerns in terms of privacy leakage, storage consumption, efficiency, bandwidth, etc. In this paper, we propose a lightweight privacy-preserving MLaaS framework by combining Compressive Sensing (CS) and Generative Networks. It’s constructed on the favorable facts observed in recent works that general inference tasks could be fulfilled with generative networks and classifier trained on compressed measurements, since the generator could model the data distribution and capture discriminative information which are useful for classification. To improve the performance of the MLaaS framework, the supervised generative models of the server are trained and optimized with prior knowledge provided by the client. In order to prevent the service provider from recovering the original data as well as identifying the queried results, a noise-addition mechanism is designed and adopted into the compressed data domain. Empirical results confirmed its performance superiority in accuracy and resource consumption against the state-of-the-art privacy preserving MLaaS frameworks.



Paperid:1720
Authors:Jiahuan Wang, Hong Chen
Huazhong Agricultural University, Huazhong Agricultural University Engineering Research Center of Intelligent Technology for Agriculture Hubei Engineering Technology Research Center of Agricultural Big Data
Abstract:
Decentralized Stochastic Gradient Descent (DSGD) represents an efficient communication approach tailored for mastering insights from vast, distributed datasets. Inspired by parallel optimization paradigms, the incorporation of minibatch serves to diminish variance, consequently expediting the optimization process. Nevertheless, as per our current understanding, the existing literature has not thoroughly explored the learning theory foundation of Decentralized Minibatch Stochastic Gradient Descent (DM-SGD). In this paper, we try to address this theoretical gap by investigating the generalization properties of DM-SGD. We establish the sharper generalization bounds for the DM-SGD algorithm with replacement (without replacement) on (non)convex and (non)smooth cases. Moreover, our results consistently recover to the results of Centralized Stochastic Gradient Descent (C-SGD). In addition, we derive generalization analysis for Zero-Order (ZO) version of DM-SGD.



Paperid:1721
Authors:Jing Wang, Songhe Feng, Gengyu Lyu, Jiazheng Yuan
Key Laboratory of Big Data and Artificial Intelligence in Transportation (Ministry of Education), School of Computer and Information Technology, Beijing Jiaotong University, Key Laboratory of Big Data and Artificial Intelligence in Transportation (Ministry of Education), School of Computer and Information Technology, Beijing Jiaotong University, Engineering Research Center of Intelligence Perception and Autonomous Control (Ministry of Education), Beijing University of Technology, College of Science and Technology, Beijing Open University
Abstract:
Deep Multiview Graph Clustering (DMGC) aims to partition instances into different groups using the graph information extracted from multi-view data. The mainstream framework of DMGC methods applies graph neural networks to embed structure information into the view-specific representations and fuse them for the consensus representation. However, on one hand, we find that the graph learned in advance is not ideal for clustering as it is constructed by original multi-view data and localized connecting. On the other hand, most existing methods learn the consensus representation in a late fusion manner, which fails to propagate the structure relations across multiple views. Inspired by the observations, we propose a Structure-adaptive Unified gRaph nEural network for multi-view clusteRing (SURER), which can jointly learn a heterogeneous multi-view unified graph and robust graph neural networks for multi-view clustering. Specifically, we first design a graph structure learning module to refine the original view-specific attribute graphs, which removes false edges and discovers the potential connection. According to the view-specific refined attribute graphs, we integrate them into a unified heterogeneous graph by linking the representations of the same sample from different views. Furthermore, we use the unified heterogeneous graph as the input of the graph neural network to learn the consensus representation for each instance, effectively integrating complementary information from various views. Extensive experiments on diverse datasets demonstrate the superior effectiveness of our method compared to other state-of-the-art approaches.



Paperid:1722
Authors:Liang Wang, Xiang Tao, Qiang Liu, Shu Wu, Liang Wang
Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Selfsupervised learning on graphs can be bifurcated into contrastive and generative methods. Contrastive methods, also known as graph contrastive learning (GCL), have dominated graph self-supervised learning in the past few years, but the recent advent of graph masked autoencoder (GraphMAE) rekindles the momentum behind generative methods. Despite the empirical success of GraphMAE, there is still a dearth of theoretical understanding regarding its efficacy. Moreover, while both generative and contrastive methods have been shown to be effective, their connections and differences have yet to be thoroughly investigated. Therefore, we theoretically build a bridge between GraphMAE and GCL, and prove that the node-level reconstruction objective in GraphMAE implicitly performs context-level GCL. Based on our theoretical analysis, we further identify the limitations of the GraphMAE from the perspectives of alignment and uniformity, which have been considered as two key properties of high-quality representations in GCL. We point out that GraphMAE's alignment performance is restricted by the masking strategy, and the uniformity is not strictly guaranteed. To remedy the aforementioned limitations, we propose an Alignment-Uniformity enhanced Graph Masked AutoEncoder, named AUG-MAE. Specifically, we propose an easy-to-hard adversarial masking strategy to provide hard-to-align samples, which improves the alignment performance. Meanwhile, we introduce an explicit uniformity regularizer to ensure the uniformity of the learned representations. Experimental results on benchmark datasets demonstrate the superiority of our model over existing state-of-the-art methods. The code is available at: https://github.com/AzureLeon1/AUG-MAE.



Paperid:1723
Authors:Luzhi Wang, Dongxiao He, He Zhang, Yixin Liu, Wenjie Wang, Shirui Pan, Di Jin, Tat-Seng Chua
College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University, Faculty of Information Technology, Monash University, Faculty of Information Technology, Monash University, School of Computing, National University of Singapore, School of Information and Communication Technology, Griffith University, College of Intelligence and Computing, Tianjin University, School of Computing, National University of Singapore
Abstract:
Graph neural networks (GNNs) have found widespread application in modeling graph data across diverse domains. While GNNs excel in scenarios where the testing data shares the distribution of their training counterparts (in distribution, ID), they often exhibit incorrect predictions when confronted with samples from an unfamiliar distribution (outof-distribution, OOD). To identify and reject OOD samples with GNNs, recent studies have explored graph OOD detection, often focusing on training a specific model or modifying the data on top of a well-trained GNN. Despite their effectiveness, these methods come with heavy training resources and costs, as they need to optimize the GNN-based models on training data. Moreover, their reliance on modifying the original GNNs and accessing training data further restricts their universality. To this end, this paper introduces a method to detect Graph Out-of-Distribution At Test-time (namely GOODAT), a data-centric, unsupervised, and plug-and-play solution that operates independently of training data and modifications of GNN architecture. With a lightweight graph masker, GOODAT can learn informative subgraphs from test samples, enabling the capture of distinct graph patterns between OOD and ID samples. To optimize the graph masker, we meticulously design three unsupervised objective functions based on the graph information bottleneck principle, motivating the masker to capture compact yet informative subgraphs for OOD detection. Comprehensive evaluations confirm that our GOODAT method outperforms state-of-the-art benchmarks across a variety of real-world datasets.



Paperid:1724
Authors:Min Wang, Xin Li, Leiji Zhang, Mingzhong Wang
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, The University of the Sunshine Coast
Abstract:
MetaReinforcement Learning (Meta-RL) aims to reveal shared characteristics in dynamics and reward functions across diverse training tasks. This objective is achieved by meta-learning a policy that is conditioned on task representations with encoded trajectory data or context, thus allowing rapid adaptation to new tasks from a known task distribution. However, since the trajectory data generated by the policy may be biased, the task inference module tends to form spurious correlations between trajectory data and specific tasks, thereby leading to poor adaptation to new tasks. To address this issue, we propose the Meta-RL with task unCertAinty feedback through decoupled context-aware Reward and Dynamics components (MetaCARD). MetaCARD distinctly decouples the dynamics and rewards when inferring tasks and integrates task uncertainty feedback from policy evaluation into the task inference module. This design effectively reduces uncertainty in tasks with changes in dynamics or/and reward functions, thereby enabling accurate task identification and adaptation. The experiment results on both Meta-World and classical MuJoCo benchmarks show that MetaCARD significantly outperforms prevailing Meta-RL baselines, demonstrating its remarkable adaptation ability in sophisticated environments that involve changes in both reward functions and dynamics.



Paperid:1725
Authors:Qian-Wei Wang, Bowen Zhao, Mingyan Zhu, Tianxiang Li, Zimo Liu, Shu-Tao Xia
Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University, Tsinghua Shenzhen International Graduate School, Tsinghua University, Tsinghua Shenzhen International Graduate School, Tsinghua University, Research Center of Artificial Intelligence, Peng Cheng Laboratory, Tsinghua Shenzhen International Graduate School, Tsinghua University Research Center of Artificial Intelligence, Peng Cheng Laboratory
Abstract:
Partial label learning (PLL) learns from training examples each associated with multiple candidate labels, among which only one is valid. In recent years, benefiting from the strong capability of dealing with ambiguous supervision and the impetus of modern data augmentation methods, consistency regularizationbased PLL methods have achieved a series of successes and become mainstream. However, as the partial annotation becomes insufficient, their performances drop significantly. In this paper, we leverage easily accessible unlabeled examples to facilitate the partial label consistency regularization. In addition to a partial supervised loss, our method performs a controller-guided consistency regularization at both the label-level and representation-level with the help of unlabeled data. To minimize the disadvantages of insufficient capabilities of the initial supervised model, we use the controller to estimate the confidence of each current prediction to guide the subsequent consistency regularization. Furthermore, we dynamically adjust the confidence thresholds so that the number of samples of each class participating in consistency regularization remains roughly equal to alleviate the problem of class-imbalance. Experiments show that our method achieves satisfactory performances in more practical situations, and its modules can be applied to existing PLL methods to enhance their capabilities.



Paperid:1726
Authors:Qingsong Wang, Zehui Liu, Chunfeng Cui, Deren Han
Xiangtan University, Beihang University, Beihang University, Beihang University
Abstract:
In this paper, we explore a specific optimization problem that involves the combination of a differentiable nonconvex function and a nondifferentiable function. The differentiable component lacks a global Lipschitz continuous gradient, posing challenges for optimization. To address this issue and accelerate the convergence, we propose a Bregman proximal stochastic gradient method with extrapolation (BPSGE), which only requires smooth adaptivity of the differentiable part. Under variance reduction framework, we not only analyze the subsequential and global convergence of the proposed algorithm under certain conditions, but also analyze the sublinear convergence rate of the subsequence, and the complexity of the algorithm, revealing that the BPSGE algorithm requires at most O(epsilon\^\,(2)) iterations in expectation to attain an epsilon-stationary point. To validate the effectiveness of our proposed algorithm, we conduct numerical experiments on three real-world applications: graph regularized nonnegative matrix factorization (NMF), matrix factorization with weakly-convex regularization, and NMF with nonconvex sparsity constraints. These experiments demonstrate that BPSGE is faster than the baselines without extrapolation. The code is available at: https://github.com/nothing2wang/BPSGE-Algorithm.



Paperid:1727
Authors:Qixin Wang, Chaoqiong Fan, Tianyuan Jia, Yuyang Han, Xia Wu
Beijing Normal University, Beijing Normal University, Beijing Normal University, Beijing Normal University, Beijing Normal University
Abstract:
Crosssensory interaction is a key aspect for multisensory recognition. Without cross-sensory interaction, artificial neural networks show inferior performance in multisensory recognition. On the contrary, the human brain has an inherently remarkable ability in multisensory recognition, which stems from the diverse neurons that exhibit distinct responses to sensory inputs, especially the multisensory neurons with multisensory responses hence enabling cross-sensory interaction. Based on this neuronal diversity, we propose a Neuronal Diversity inspired Multisensory Recognition Model (ND-MRM), which, similar to the brain, comprises unisensory neurons and multisensory neurons. To reflect the different responses characteristics of diverse neurons in the brain, special connection constraints are innovatively designed to regulate the features transmission in the ND-MRM. Leveraging this novel concept of neuronal diversity, our model is biologically plausible, enabling more effective recognition of multisensory information. To validate the performance of the proposed ND-MRM, we employ a multisensory emotion recognition task as a case study. The results demonstrate that our model surpasses state-of-the-art brain-inspired baselines on two datasets, proving the potential of brain-inspired methods for advancing multisensory interaction and recognition.



Paperid:1728
Authors:Runqi Wang, Huixin Sun, Linlin Yang, Shaohui Lin, Chuanjian Liu, Yan Gao, Yao Hu, Baochang Zhang
ASEE, EIE and Hangzhou Research Institute, Beihang University, ASEE, EIE and Hangzhou Research Institute, Beihang University, State Key Laboratory of Media Convergence and Communication, Communication University of China, School of Computer Science and Technology, East China Normal University, Huawei Noah's Ark Lab, Xiaohongshu Inc, Xiaohongshu Inc, ASEE, EIE and Hangzhou Research Institute, Beihang University Zhongguancun Laboratory Nanchang Institute of Technology
Abstract:
DEtection TRansformer (DETR)based models have achieved remarkable performance. However, they are accompanied by a large computation overhead cost, which significantly prevents their applications on resource-limited devices. Prior arts attempt to reduce the computational burden of DETR using low-bit quantization, while these methods sacrifice a severe significant performance on weight-activation-attention low-bit quantization. We observe that the number of matching queries and positive samples affect much on the representation capacity of queries in DETR, while quantifying queries of DETR further reduces its representational capacity, thus leading to a severe performance drop. We introduce a new quantization strategy based on Auxiliary Queries for DETR (AQ-DETR), aiming to enhance the capacity of quantized queries. In addition, a layer-by-layer distillation is proposed to reduce the quantization error between quantized attention and full-precision counterpart. Through our extensive experiments on large-scale open datasets, the performance of the 4-bit quantization of DETR and Deformable DETR models is comparable to full-precision counterparts.



Paperid:1729
Authors:Shengbo Wang, Ke Li
University of Electronic Science and Technology of China, University of Exeter
Abstract:
The partially observable constrained optimization problems (POCOPs) impede datadriven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduce balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose Gaussian processes embedding different likelihoods as the surrogate model for partially observable constraints. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs.



Paperid:1730
Authors:Shufan Wang, Guojun Xiong, Jian Li
Stony Brook University, Stony Brook University, Stony Brook University
Abstract:
Restless multiarmed bandits (RMAB) have been widely used to model sequential decision making problems with constraints. The decision maker (DM) aims to maximize the expected total reward over an infinite horizon under an “instantaneous activation constraint” that at most B arms can be activated at any decision epoch, where the state of each arm evolves stochastically according to a Markov decision process (MDP). However, this basic model fails to provide any fairness guarantee among arms. In this paper, we introduce RMAB-F, a new RMAB model with “long-term fairness constraints”, where the objective now is to maximize the longterm reward while a minimum long-term activation fraction for each arm must be satisfied. For the online RMAB-F setting (i.e., the underlying MDPs associated with each arm are unknown to the DM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL. We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the reward regret and the fairness violation regret. Compared with off-the-shelf RL methods, our Fair-UCRL is much more computationally efficient since it contains a novel exploitation that leverages a low-complexity index policy for making decisions. Experimental results further demonstrate the effectiveness of our Fair-UCRL.



Paperid:1731
Authors:Wanying Wang, Yichen Zhu, Yirui Zhou, Chaomin Shen, Jian Tang, Zhiyuan Xu, Yaxin Peng, Yangchun Zhang
Department of Mathematics, College of Science, Shanghai University, Midea Group, Department of Mathematics, College of Science, Shanghai University, East China Normal University, Midea Group, Midea Group, Department of Mathematics, College of Science, Shanghai University, Department of Mathematics, College of Science, Shanghai University
Abstract:
Generative Adversarial Imitation Learning (GAIL) stands as a cornerstone approach in imitation learning. This paper investigates the gradient explosion in two types of GAIL: GAIL with deterministic policy (DEGAIL) and GAIL with stochastic policy (ST-GAIL). We begin with the observation that the training can be highly unstable for DE-GAIL at the beginning of the training phase and end up divergence. Conversely, the ST-GAIL training trajectory remains consistent, reliably converging. To shed light on these disparities, we provide an explanation from a theoretical perspective. By establishing a probabilistic lower bound for GAIL, we demonstrate that gradient explosion is an inevitable outcome for DE-GAIL due to occasionally large expert-imitator policy disparity, whereas ST-GAIL does not have the issue with it. To substantiate our assertion, we illustrate how modifications in the reward function can mitigate the gradient explosion challenge. Finally, we propose CREDO, a simple yet effective strategy that clips the reward function during the training phase, allowing the GAIL to enjoy high data efficiency and stable trainability.



Paperid:1732
Authors:Wenjie Wang, Pengfei Tang, Jian Lou, Yuanming Shao, Lance Waller, Yi-an Ko, Li Xiong
ShanghaiTech University, Emory University, ZJU-Hangzhou Global Scientific and Technological Innovation Center, ShanghaiTech University, Emory University, Emory Unviversity, Emory University
Abstract:
Integrating electronic health records (EHR) into machine learningdriven clinical research and hospital applications is important, as it harnesses extensive and high-quality patient data to enhance outcome predictions and treatment personalization. Nonetheless, due to privacy and security concerns, the secondary purpose of EHR data is consistently governed and regulated, primarily for research intentions, thereby constraining researchers' access to EHR data. Generating synthetic EHR data with deep learning methods is a viable and promising approach to mitigate privacy concerns, offering not only a supplementary resource for downstream applications but also sidestepping the confidentiality risks associated with real patient data. While prior efforts have concentrated on EHR data synthesis, significant challenges persist in the domain of generating synthetic EHR data: balancing the heterogeneity of real EHR including temporal and non-temporal features, addressing the missing values and irregular measures, and ensuring the privacy of the real data used for model training. Existing works in this domain only focused on solving one or two aforementioned challenges. In this work, we propose IGAMT, an innovative framework to generate privacy-preserved synthetic EHR data that not only maintain high quality with heterogeneous features, missing values, and irregular measures but also balances the privacy-utility trade-off. Extensive experiments prove that IGAMT significantly outperforms baseline architectures in terms of visual resemblance and comparable performance in downstream applications. Ablation case studies also prove the effectiveness of the techniques applied in IGAMT.



Paperid:1733
Authors:Ximei Wang, Junwei Pan, Xingzhuo Guo, Dapeng Liu, Jie Jiang
Tencent Inc., Tencent Inc., Tsinghua University, Tencent Inc., Tencent Inc.
Abstract:
Multidomain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training (D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi-heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.



Paperid:1734
Authors:Yan Wang, Chuan-Xian Ren, Yi-Ming Zhai, You-Wei Luo, Hong Yan
School of Mathematics, Sun Yat-Sen University, China, School of Mathematics, Sun Yat-Sen University, China, School of Mathematics, Sun Yat-Sen University, China, School of Mathematics, Sun Yat-Sen University, China, Department of Electrical Engineering, City University of Hong Kong, Hong Kong
Abstract:
Optimal transport (OT) is an important methodology to measure distribution discrepancy, which has achieved promising performance in artificial intelligence applications, e.g., unsupervised domain adaptation. However, from the view of transportation, there are still limitations: 1) the local discriminative structures for downstream tasks, e.g., cluster structure for classification, cannot be explicitly admitted by the learned OT plan; 2) the entropy regularization induces a dense OT plan with increasing uncertainty. To tackle these issues, we propose a novel ProbabilityPolarized OT (PPOT) framework, which can characterize the structure of OT plan explicitly. Specifically, the probability polarization mechanism is proposed to guide the optimization direction of OT plan, which generates a clear margin between similar and dissimilar transport pairs and reduces the uncertainty. Further, a dynamic mechanism for margin is developed by incorporating task-related information into the polarization, which directly captures the intra/inter class correspondence for knowledge transportation. A mathematical understanding for PPOT is provided from the view of gradient, which ensures interpretability. Extensive experiments on several datasets validate the effectiveness and empirical efficiency of PPOT.



Paperid:1735
Authors:Yejiang Wang, Yuhai Zhao, Zhengkui Wang, Wen Shan, Xingwei Wang
Northeastern University, China, Northeastern University, China, Singapore Institute of Technology, Singapore University of Social Sciences, Northeastern University, China
Abstract:
Limitedsupervised multi-label learning (LML) leverages weak or noisy supervision for multi-label classification model training over data with label noise, which contain missing labels and/or redundant labels. Existing studies usually solve LML problems by assuming that label noise is independent of the input features and class labels, while ignoring the fact that noisy labels may depend on the input features (instance-dependent) and the classes (label-dependent) in many real-world applications. In this paper, we propose limited-supervised Multi-label Learning with Dependency Noise (MLDN) to simultaneously identify the instance-dependent and label-dependent label noise by factorizing the noise matrix as the outputs of a mapping from the feature and label representations. Meanwhile, we regularize the problem with the manifold constraint on noise matrix to preserve local relationships and uncover the manifold structure. Theoretically, we bound noise recover error for the resulting problem. We solve the problem by using a first-order scheme based on proximal operator, and the convergence rate of it is at least sub-linear. Extensive experiments conducted on various datasets demonstrate the superiority of our proposed method.



Paperid:1736
Authors:Yibo Wang, Wenhao Yang, Wei Jiang, Shiyin Lu, Bing Wang, Haihong Tang, Yuanyu Wan, Lijun Zhang
National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University, Alibaba Group, Alibaba Group, Alibaba Group, School of Software Technology, Zhejiang University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University
Abstract:
Projectionfree online learning has drawn increasing interest due to its efficiency in solving high-dimensional problems with complicated constraints. However, most existing projection-free online methods focus on minimizing the static regret, which unfortunately fails to capture the challenge of changing environments. In this paper, we investigate non-stationary projection-free online learning, and choose dynamic regret and adaptive regret to measure the performance. Specifically, we first provide a novel dynamic regret analysis for an existing projection-free method named BOGD_IP, and establish an O(T^¾ (1+P_T)) dynamic regret bound, where P_T denotes the path-length of the comparator sequence. Then, we improve the upper bound to O(T^¾ (1+P_T)^¼) by running multiple BOGD_IP algorithms with different step sizes in parallel, and tracking the best one on the fly. Our results are the first general-case dynamic regret bounds for projection-free online learning, and can recover the existing O(T^¾) static regret by setting P_T = 0. Furthermore, we propose a projection-free method to attain an O(?^¾) adaptive regret bound for any interval with length ?, which nearly matches the static regret over that interval. The essential idea is to maintain a set of BOGD_IP algorithms dynamically, and combine them by a meta algorithm. Moreover, we demonstrate that it is also equipped with an O(T^¾ (1+P_T)^¼) dynamic regret bound. Finally, empirical studies verify our theoretical findings.



Paperid:1737
Authors:Yifeng Wang, Yi Zhao
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen
Abstract:
As attitude and motion sensing components, inertial sensors are widely used in various portable devices, covering consumer electronics, sports health, aerospace, etc. But the severe intrinsic errors of inertial sensors heavily restrain their function implementation, especially the advanced functionality, including motion trajectory recovery and motion semantic recognition, which attracts considerable attention. As a mainstream signal processing method, wavelet is hailed as the mathematical microscope of signal due to the plentiful and diverse wavelet basis functions. However, complicated noise types and application scenarios of inertial sensors make selecting wavelet basis perplexing. To this end, we propose a wavelet dynamic selection network (WDSNet), which intelligently selects the appropriate wavelet basis for variable inertial signals. In addition, existing deep learning architectures excel at extracting features from input data but neglect to learn the characteristics of target categories, which is essential to enhance the category awareness capability, thereby improving the selection of wavelet basis. Therefore, we propose a category representation mechanism (CRM), which enables the network to extract and represent category features without increasing trainable parameters. Furthermore, CRM transforms the common fully connected network into category representations, which provide closer supervision to the feature extractor than the far and trivial onehot classification labels. We call this process of imposing interpretability on a network and using it to supervise the feature extractor the feature supervision mechanism, and its effectiveness is demonstrated experimentally and theoretically in this paper. The enhanced inertial signal can perform impracticable tasks with regard to the original signal, such as trajectory reconstruction. Both quantitative and visual results show that WDSNet outperforms the existing methods. Remarkably, WDSNet, as a weakly-supervised method, achieves the state-of-the-art performance of all the compared fully-supervised methods.



Paperid:1738
Authors:Yimu Wang, Yihan Wu, Hongyang Zhang
University of Waterloo, University of Maryland, College Park, University of Waterloo
Abstract:
We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domaininvariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, any learning algorithm that outputs a classifier with an ε excess error to the Bayes optimal classifier requires at least poly(1/ε) number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains.



Paperid:1739
Authors:Yu Wang, Yuxuan Yin, Karthik Somayaji NS, Ján Drgoňa, Malachi Schram, Mahantesh Halappanavar, Frank Liu, Peng Li
University of California, Santa Barbara, University of California, Santa Barbara, University of California, Santa Barbara, Pacific Northwest National Laboratory, Thomas Jefferson National Accelerator Facility, Pacific Northwest National Laboratory, Oak Ridge National Laboratory, University of California, Santa Barbara
Abstract:
Modeling dynamical systems is crucial for a wide range of tasks, but it remains challenging due to complex nonlinear dynamics, limited observations, or lack of prior knowledge. Recently, datadriven approaches such as Neural Ordinary Differential Equations (NODE) have shown promising results by leveraging the expressive power of neural networks to model unknown dynamics. However, these approaches often suffer from limited labeled training data, leading to poor generalization and suboptimal predictions. On the other hand, semi-supervised algorithms can utilize abundant unlabeled data and have demonstrated good performance in classification and regression tasks. We propose TS-NODE, the first semi-supervised approach to modeling dynamical systems with NODE. TS-NODE explores cheaply generated synthetic pseudo rollouts to broaden exploration in the state space and to tackle the challenges brought by lack of ground-truth system data under a teacher-student model. TS-NODE employs an unified optimization framework that corrects the teacher model based on the student's feedback while mitigating the potential false system dynamics present in pseudo rollouts. TS-NODE demonstrates significant performance improvements over a baseline Neural ODE model on multiple dynamical system modeling tasks.



Paperid:1740
Authors:Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao
Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory SenseTime Group LTD, Shanghai Artificial Intelligence Laboratory
Abstract:
Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of ReturnConditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.



Paperid:1741
Authors:Yucheng Wang, Yuecong Xu, Jianfei Yang, Min Wu, Xiaoli Li, Lihua Xie, Zhenghua Chen
Institute for Infocomm Research, A*STAR Nanyang Technological University, Institute for Infocomm Research, A*STAR, Nanyang Technological University, Institute for Infocomm Research, A*STAR, Institute for Infocomm Research, A*STAR Nanyang Technological University Centre for Frontier AI Research, A*STAR, Nanyang Technological University, Institute for Infocomm Research, A*STAR Centre for Frontier AI Research, A*STAR
Abstract:
Multivariate TimeSeries (MTS) data is crucial in various application fields. With its sequential and multi-source (multiple sensors) properties, MTS data inherently exhibits Spatial-Temporal (ST) dependencies, involving temporal correlations between timestamps and spatial correlations between sensors in each timestamp. To effectively leverage this information, Graph Neural Network-based methods (GNNs) have been widely adopted. However, existing approaches separately capture spatial dependency and temporal dependency and fail to capture the correlations between Different sEnsors at Different Timestamps (DEDT). Overlooking such correlations hinders the comprehensive modelling of ST dependencies within MTS data, thus restricting existing GNNs from learning effective representations. To address this limitation, we propose a novel method called Fully-Connected Spatial-Temporal Graph Neural Network (FC-STGNN), including two key components namely FC graph construction and FC graph convolution. For graph construction, we design a decay graph to connect sensors across all timestamps based on their temporal distances, enabling us to fully model the ST dependencies by considering the correlations between DEDT. Further, we devise FC graph convolution with a moving-pooling GNN layer to effectively capture the ST dependencies for learning effective representations. Extensive experiments show the effectiveness of FC-STGNN on multiple MTS datasets compared to SOTA methods. The code is available at https://github.com/Frank-Wang-oss/FCSTGNN.



Paperid:1742
Authors:Yucheng Wang, Yuecong Xu, Jianfei Yang, Min Wu, Xiaoli Li, Lihua Xie, Zhenghua Chen
Institute for Infocomm Research, A*STAR Nanyang Technological University, Institute for Infocomm Research, A*STAR, Nanyang Technological University, Institute for Infocomm Research, A*STAR, Institute for Infocomm Research, A*STAR Nanyang Technological University Centre for Frontier AI Research, A*STAR, Nanyang Technological University, Institute for Infocomm Research, A*STAR Centre for Frontier AI Research, A*STAR
Abstract:
Contrastive learning, as a selfsupervised learning paradigm, becomes popular for Multivariate Time-Series (MTS) classification. It ensures the consistency across different views of unlabeled samples and then learns effective representations for these samples. Existing contrastive learning methods mainly focus on achieving temporal consistency with temporal augmentation and contrasting techniques, aiming to preserve temporal patterns against perturbations for MTS data. However, they overlook spatial consistency that requires the stability of individual sensors and their correlations. As MTS data typically originate from multiple sensors, ensuring spatial consistency becomes essential for the overall performance of contrastive learning on MTS data. Thus, we propose Graph-Aware Contrasting for spatial consistency across MTS data. Specifically, we propose graph augmentations including node and edge augmentations to preserve the stability of sensors and their correlations, followed by graph contrasting with both node- and graph-level contrasting to extract robust sensor- and global-level features. We further introduce multi-window temporal contrasting to ensure temporal consistency in the data for each sensor. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on various MTS classification tasks. The code is available at https://github.com/Frank-Wang-oss/TS-GAC.



Paperid:1743
Authors:Yulong Wang
College of Informatics, Huazhong Agricultural University, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, China Key Laboratory of Smart Farming Technology for Agricultural Animals, Ministry of Agriculture and Rural Affairs, China
Abstract:
This paper proposes a unified Superposed Atomic Representation (SAR) framework for highdimensional data recovery with multiple low-dimensional structures. The data can be in various forms ranging from vectors to tensors. The goal of SAR is to recover different components from their sum, where each component has a low-dimensional structure, such as sparsity, low-rankness or be lying a low-dimensional subspace. Examples of SAR include, but not limited to, Robust Sparse Representation (RSR), Robust Principal Component Analysis (RPCA), Tensor RPCA (TRPCA), and Outlier Pursuit (OP). We establish the theoretical guarantee for SAR. To further improve SAR, we also develop a Weighted SAR (WSAR) framework by paying more attention and penalizing less on significant atoms of each component. An effective optimization algorithm is devised for WSAR and the convergence of the algorithm is rigorously proved. By leveraging WSAR as a general platform, several new methods are proposed for high-dimensional data recovery. The experiments on real data demonstrate the superiority of WSAR for various data recovery problems.



Paperid:1744
Authors:Yunpeng Wang, Meng Pang, Shengbo Chen, Hong Rao
Nanchang University, Nanchang University, Henan University, Nanchang University
Abstract:
For generative learning tasks, there are three crucial criteria for generating samples from the models: quality, coverage/diversity, and sampling speed. Among the existing generative models, Generative adversarial networks (GANs) and diffusion models demonstrate outstanding quality performance while suffering from notable limitations. GANs can generate highquality results and enable fast sampling, their drawbacks, however, lie in the limited diversity of the generated samples. On the other hand, diffusion models excel at generating high-quality results with a commendable diversity. Yet, its iterative generation process necessitates hundreds to thousands of sampling steps, leading to slow speeds that are impractical for real-time scenarios. To address the aforementioned problem, this paper proposes a novel Consistency-GAN model. In particular, to aid in the training of the GAN, we introduce instance noise, which employs consistency models using only a few steps compared to the conventional diffusion process. Our evaluations on various datasets indicate that our approach significantly accelerates sampling speeds compared to traditional diffusion models, while preserving sample quality and diversity. Furthermore, our approach also has better model coverage than traditional adversarial training methods.



Paperid:1745
Authors:Zekai Wang, Zhengyu Zhou, Weiwei Liu
School of Computer Science, Wuhan University, School of Computer Science, Wuhan University, School of Computer Science, Wuhan University
Abstract:
Randomized smoothing (RS) has provided stateof-the-art (SOTA) certified robustness against adversarial perturbations for large neural networks. Among studies in this field, methods based on adversarial training (AT) achieve remarkably robust performance by applying adversarial examples to construct the smoothed classifier. These AT-based RS methods typically seek a pointwise adversary that generates the worst-case adversarial examples by perturbing each input independently. However, there are unexplored benefits to considering such adversarial robustness across the entire data distribution. To this end, we provide a novel framework called DRF, which connects AT-based RS methods with distributional robustness (DR), and show that these methods are special cases of their counterparts in our framework. Due to the advantages conferred by DR, our framework can control the trade-off between the clean accuracy and certified robustness of smoothed classifiers to a significant extent. Our experiments demonstrate that DRF can substantially improve the certified robustness of AT-based RS.



Paperid:1746
Authors:Zhenyu Wang, Hao Luo, Xuemei Xie, Fan Wang, Guangming Shi
Hangzhou Institute of Technology, Xidian University, DAMO Academy, Alibaba group Hupan Lab, Guangzhou Institute of Technology, Xidian University Pazhou Lab, DAMO Academy, Alibaba group, Guangzhou Institute of Technology, Xidian University
Abstract:
As a compression method that can significantly reduce the cost of calculations and memories, model binarization has been extensively studied in convolutional neural networks. However, the recently popular vision transformer models pose new challenges to such a technique, in which the binarized models suffer from serious performance drops. In this paper, an attention shifting is observed in the binary multihead self-attention module, which can influence the information fusion between tokens and thus hurts the model performance. From the perspective of information theory, we find a correlation between attention scores and the information quantity, further indicating that a reason for such a phenomenon may be the loss of the information quantity induced by constant moduli of binarized tokens. Finally, we reveal the information quantity hidden in the attention maps of binary vision transformers and propose a simple approach to modify the attention values with look-up information tables so that improve the model performance. Extensive experiments on CIFAR-100/TinyImageNet/ImageNet-1k demonstrate the effectiveness of the proposed information-modified attention on binary vision transformers.



Paperid:1747
Authors:Zhiwei Wang, Huazheng Wang, Hongning Wang
Tsinghua University, Oregon State University, Tsinghua University
Abstract:
Adversarial attacks against stochastic multiarmed bandit (MAB) algorithms have been extensively studied in the literature. In this work, we focus on reward poisoning attacks and find most existing attacks can be easily detected by our proposed detection method based on the test of homogeneity, due to their aggressive nature in reward manipulations. This motivates us to study the notion of stealthy attack against stochastic MABs and investigate the resulting attackability. Our analysis shows that against two popularly employed MAB algorithms, UCB1 and $\epsilon$-greedy, the success of a stealthy attack depends on the environmental conditions and the realized reward of the arm pulled in the first round. We also analyze the situation for general MAB algorithms equipped with our attack detection method and find that it is possible to have a stealthy attack that almost always succeeds. This brings new insights into the security risks of MAB algorithms.



Paperid:1748
Authors:Zizhao Wang, Caroline Wang, Xuesu Xiao, Yuke Zhu, Peter Stone
the University of Texas at Austin, the University of Texas at Austin, George Mason University, the University of Texas at Austin, the University of Texas at Austin Sony AI
Abstract:
Two desiderata of reinforcement learning (RL) algorithms are the ability to learn from relatively little experience and the ability to learn policies that generalize to a range of problem specifications. In factored state spaces, one approach towards achieving both goals is to learn state abstractions, which only keep the necessary variables for learning the tasks at hand. This paper introduces Causal Bisimulation Modeling (CBM), a method that learns the causal relationships in the dynamics and reward functions for each task to derive a minimal, taskspecific abstraction. CBM leverages and improves implicit modeling to train a high-fidelity causal dynamics model that can be reused for all tasks in the same environment. Empirical validation on two manipulation environments and four tasks reveals that CBM's learned implicit dynamics models identify the underlying causal relationships and state abstractions more accurately than explicit ones. Furthermore, the derived state abstractions allow a task learner to achieve near-oracle levels of sample efficiency and outperform baselines on all tasks.



Paperid:1749
Authors:Tong Wei, Bo-Lin Wang, Min-Ling Zhang
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China
Abstract:
Despite recent advancements in outof-distribution (OOD) detection, most current studies assume a class-balanced in-distribution training dataset, which is rarely the case in real-world scenarios. This paper addresses the challenging task of long-tailed OOD detection, where the in-distribution data follows a long-tailed class distribution. The main difficulty lies in distinguishing OOD data from samples belonging to the tail classes, as the ability of a classifier to detect OOD instances is not strongly correlated with its accuracy on the in-distribution classes. To overcome this issue, we propose two simple ideas: (1) Expanding the in-distribution class space by introducing multiple abstention classes. This approach allows us to build a detector with clear decision boundaries by training on OOD data using virtual labels. (2) Augmenting the context-limited tail classes by overlaying images onto the context-rich OOD data. This technique encourages the model to pay more attention to the discriminative features of the tail classes. We provide a clue for separating in-distribution and OOD data by analyzing gradient noise. Through extensive experiments, we demonstrate that our method outperforms the current state-of-the-art on various benchmark datasets. Moreover, our method can be used as an add-on for existing long-tail learning approaches, significantly enhancing their OOD detection performance. Code is available at: https://github.com/Stomach-ache/Long-Tailed-OOD-Detection.



Paperid:1750
Authors:Xiang Wei, Alan J.X. Guo, Sihan Sun, Mengyi Wei, Wei Yu
Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China, Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China, Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China, Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China, China Mobile Research Institute, No. 32, Xuanwumen West Street, Beijing, 100053, China
Abstract:
Efficient computation or approximation of Levenshtein distance, a widelyused metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.



Paperid:1751
Authors:Yikang Wei, Yahong Han
Tianjin University, Tianjin University
Abstract:
Federated Domain Generalization aims to learn a domaininvariant model from multiple decentralized source domains for deployment on unseen target domain. Due to privacy concerns, the data from different source domains are kept isolated, which poses challenges in bridging the domain gap. To address this issue, we propose a Multi-source Collaborative Gradient Discrepancy Minimization (MCGDM) method for federated domain generalization. Specifically, we propose intra-domain gradient matching between the original images and augmented images to avoid overfitting the domain-specific information within isolated domains. Additionally, we propose inter-domain gradient matching with the collaboration of other domains, which can further reduce the domain shift across decentralized domains. Combining intra-domain and inter-domain gradient matching, our method enables the learned model to generalize well on unseen domains. Furthermore, our method can be extended to the federated domain adaptation task by fine-tuning the target model on the pseudo-labeled target domain. The extensive experiments on federated domain generalization and adaptation indicate that our method outperforms the state-of-the-art methods significantly.



Paperid:1752
Authors:Zimian Wei, Peijie Dong, Zheng Hui, Anggeng Li, Lujun Li, Menglong Lu, Hengyue Pan, Dongsheng Li
National University of Defense Technology, The Hong Kong University of Science and Technology (Guangzhou), Columbia University, Huawei Technologies Ltd., The Hong Kong University of Science and Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology
Abstract:
The substantial success of Vision Transformer (ViT) in computer vision tasks is largely attributed to the architecture design. This underscores the necessity of efficient architecture search for designing better ViTs automatically. As trainingbased architecture search methods are computationally intensive, there’s a growing interest in training-free methods that use zero-cost proxies to score ViTs. However, existing training-free approaches require expert knowledge to manually design specific zero-cost proxies. Moreover, these zero-cost proxies exhibit limitations to generalize across diverse domains. In this paper, we introduce Auto-Prox, an automatic proxy discovery framework, to address the problem. First, we build the ViT-Bench-101, which involves different ViT candidates and their actual performance on multiple datasets. Utilizing ViT-Bench-101, we can evaluate zero-cost proxies based on their score-accuracy correlation. Then, we represent zero-cost proxies with computation graphs and organize the zero-cost proxy search space with ViT statistics and primitive operations. To discover generic zero-cost proxies, we propose a joint correlation metric to evolve and mutate different zero-cost proxy candidates. We introduce an elitism-preserve strategy for search efficiency to achieve a better trade-off between exploitation and exploration. Based on the discovered zero-cost proxy, we conduct a ViT architecture search in a training-free manner. Extensive experiments demonstrate that our method generalizes well to different datasets and achieves state-of-the-art results both in ranking correlation and final accuracy. Codes can be found at https://github.com/lilujunai/Auto-Prox-AAAI24.



Paperid:1753
Authors:Lingfeng Wen, Xuan Tang, Mingjie Ouyang, Xiangxiang Shen, Jian Yang, Daxin Zhu, Mingsong Chen, Xian Wei
MoE Engineering Research Center of Hardware/Software Co-design Technology and Application, East China Normal University, School of Communication and Electronic Engineering, East China Normal University, MoE Engineering Research Center of Hardware/Software Co-design Technology and Application, East China Normal University, MoE Engineering Research Center of Hardware/Software Co-design Technology and Application, East China Normal University, School of Geospatial Information, Information Engineering University, China, Quanzhou Normal University, MoE Engineering Research Center of Hardware/Software Co-design Technology and Application, East China Normal University, MoE Engineering Research Center of Hardware/Software Co-design Technology and Application, East China Normal University
Abstract:
Diffusion generative models (DMs) have achieved promising results in image and graph generation. However, realworld graphs, such as social networks, molecular graphs, and traffic graphs, generally share non-Euclidean topologies and hidden hierarchies. For example, the degree distributions of graphs are mostly power-law distributions. The current latent diffusion model embeds the hierarchical data in a Euclidean space, which leads to distortions and interferes with modeling the distribution. Instead, hyperbolic space has been found to be more suitable for capturing complex hierarchical structures due to its exponential growth property. In order to simultaneously utilize the data generation capabilities of diffusion models and the ability of hyperbolic embeddings to extract latent hierarchical distributions, we propose a novel graph generation method called, Hyperbolic Graph Diffusion Model (HGDM), which consists of an auto-encoder to encode nodes into successive hyperbolic embeddings, and a DM that operates in the hyperbolic latent space. HGDM captures the crucial graph structure distributions by constructing a hyperbolic potential node space that incorporates edge information. Extensive experiments show that HGDM achieves better performance in generic graph and molecule generation benchmarks, with a 48% improvement in the quality of graph generation with highly hierarchical structures.



Paperid:1754
Authors:Ming Wen, Chengchang Liu, Yuedong Xu
Fudan University, The Chinese University of Hong Kong, Fudan University
Abstract:
Distributed optimization in resource constrained devices demands both communication efficiency and fast convergence rates. Newtontype methods are getting preferable due to their superior convergence rates compared to the first-order methods. In this paper, we study a new problem in regard to the second-order distributed optimization over unreliable networks. The working devices are power-limited or operate in unfavorable wireless channels, experiencing packet losses during their uplink transmission to the server. Our scenario is very common in real-world and leads to instability of classical distributed optimization methods especially the second-order methods because of their sensitivity to the imprecision of local Hessian matrices. To achieve robustness to high packet loss, communication efficiency and fast convergence rates, we propose a novel distributed second-order method, called RED-New (Packet loss Resilient Distributed Approximate Newton). Each iteration of RED-New comprises two rounds of light-weight and lossy transmissions, in which the server aggregates the local information with a new developed scaling strategy. We prove the linear-quadratic convergence rate of RED-New. Experimental results demonstrate its advantage over first-order and second-order baselines, and its tolerance to packet loss rate ranging from 5% to 40%.



Paperid:1755
Authors:Zichen Wen, Yawen Ling, Yazhou Ren, Tianyi Wu, Jianpeng Chen, Xiaorong Pu, Zhifeng Hao, Lifang He
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, Department of Computer Science, Virginia Tech, Blacksburg, VA, USA, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China, College of Science, Shantou University, Shantou, China, Department Computer Science and Engineering, Lehigh University, Bethlehem, PA, USA
Abstract:
Recently there is a growing focus on graph data, and multiview graph clustering has become a popular area of research interest. Most of the existing methods are only applicable to homophilous graphs, yet the extensive real-world graph data can hardly fulfill the homophily assumption, where the connected nodes tend to belong to the same class. Several studies have pointed out that the poor performance on heterophilous graphs is actually due to the fact that conventional graph neural networks (GNNs), which are essentially low-pass filters, discard information other than the low-frequency information on the graph. Nevertheless, on certain graphs, particularly heterophilous ones, neglecting high-frequency information and focusing solely on low-frequency information impedes the learning of node representations. To break this limitation, our motivation is to perform graph filtering that is closely related to the homophily degree of the given graph, with the aim of fully leveraging both low-frequency and high-frequency signals to learn distinguishable node embedding. In this work, we propose Adaptive Hybrid Graph Filter for Multi-View Graph Clustering (AHGFC). Specifically, a graph joint process and graph joint aggregation matrix are first designed by using the intrinsic node features and adjacency relationship, which makes the low and high-frequency signals on the graph more distinguishable. Then we design an adaptive hybrid graph filter that is related to the homophily degree, which learns the node embedding based on the graph joint aggregation matrix. After that, the node embedding of each view is weighted and fused into a consensus embedding for the downstream task. Experimental results show that our proposed model performs well on six datasets containing homophilous and heterophilous graphs.



Paperid:1756
Authors:Luisa Werner, Nabil Layaïda, Pierre Genevès, Jérôme Euzenat, Damien Graux
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG F-38000 Grenoble, France, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG F-38000 Grenoble, France, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG F-38000 Grenoble, France, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG F-38000 Grenoble, France, ADAPT SFI Research Centre, Trinity College Dublin, Ireland.
Abstract:
Reproducibility is a desirable property of scientific research. On the one hand, it increases confidence in results. On the other hand, reproducible results can be extended on a solid basis. In rapidly developing fields such as machine learning, the latter is particularly important to ensure the reliability of research. In this paper, we present a systematic approach to reproducing (using the available implementation), replicating (using an alternative implementation) and reevaluating (using different datasets) stateof-the-art experiments. This approach enables the early detection and correction of deficiencies and thus the development of more robust and transparent machine learning methods. We detail the independent reproduction, replication, and reevaluation of the initially published experiments with a method that we want to extend. For each step, we identify issues and draw lessons learned. We further discuss solutions that have proven effective in overcoming the encountered problems. This work can serve as a guide for further reproducibility studies and generally improve reproducibility in machine learning.



Paperid:1757
Authors:Jonathan Wilton, Nan Ye
The University of Queensland, The University of Queensland
Abstract:
We consider training decision trees using noisily labeled data, focusing on loss functions that can lead to robust learning algorithms. Our contributions are threefold. First, we offer novel theoretical insights on the robustness of many existing loss functions in the context of decision tree learning. We show that some of the losses belong to a class of what we call conservative losses, and the conservative losses lead to an early stopping behavior during training and noisetolerant predictions during testing. Second, we introduce a framework for constructing robust loss functions, called distribution losses. These losses apply percentile-based penalties based on an assumed margin distribution, and they naturally allow adapting to different noise rates via a robustness parameter. In particular, we introduce a new loss called the negative exponential loss, which leads to an efficient greedy impurity-reduction learning algorithm. Lastly, our experiments on multiple datasets and noise settings validate our theoretical insight and the effectiveness of our adaptive negative exponential loss.



Paperid:1758
Authors:Di Wu, Yuling Jiao, Li Shen, Haizhao Yang, Xiliang Lu
Wuhan University, Wuhan University, JD Explore Academy, University of Maryland, College Park, Wuhan University
Abstract:
Deep reinforcement learning (RL) has shown remarkable success in specific offline decisionmaking scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with C-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for C-mixing sequences and the neural network approximation theory for the Holder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.



Paperid:1759
Authors:Dong Wu, Mingmin Chi, Xuan Zang, Bo Peng
Fudan University, Fudan University Zhongshan PoolNet Technology Co., Ltd, Fudan University, Shanghai Ocean University
Abstract:
With the increasing customization of spectrometers, spectral unmixing has become a widely used technique in fields such as remote sensing, textiles, and environmental protection. However, endmember variability is a common issue for unmixing, where changes in lighting, atmospheric, temporal conditions, or the intrinsic spectral characteristics of materials, can all result in variations in the measured spectrum. Recent studies have employed deep neural networks to tackle endmember variability. However, these approaches rely on generic networks to implicitly resolve the issue, which struggles with the illposed nature and lack of effective convergence constraints for endmember variability. This paper proposes a streamlined multi-task learning model to rectify this problem, incorporating abundance regression and multi-label classification with Unmixing as a Bayesian Inverse Problem, denoted as BIPU. To address the issue of the ill-posed nature, the uncertainty of unmixing is quantified and minimized through the Laplace approximation in a Bayesian inverse solver. In addition, to improve convergence under the influence of endmember variability, the paper introduces two types of constraints. The first separates background factors of variants from the initial factors for each endmember, while the second identifies and eliminates the influence of non-existent endmembers via multi-label classification during convergence. The effectiveness of this model is demonstrated not only on a self-collected near-infrared spectral textile dataset (FENIR), but also on three commonly used remote sensing hyperspectral image datasets, where it achieves state-of-the-art unmixing performance and exhibits strong generalization capabilities.



Paperid:1760
Authors:Dong-Dong Wu, Deng-Bao Wang, Min-Ling Zhang
Southeast University, Southeast University, Southeast University
Abstract:
Partial label learning (PLL) refers to the classification task where each training instance is ambiguously annotated with a set of candidate labels. Despite substantial advancements in tackling this challenge, limited attention has been devoted to a more specific and realistic setting, denoted as instancedependent partial label learning (IDPLL). Within this contex, the assignment of partial labels depends on the distinct features of individual instances, rather than being random. In this paper, we initiate an exploration into a self-distillation framework for this problem, driven by the proven effectiveness and stability of this framework. Nonetheless, a crucial shortfall is identified: the foundational assumption central to IDPLL, involving what we term as partial label knowledge stipulating that candidate labels should exhibit superior confidence compared to non-candidates, is not fully upheld within the distillation process. To address this challenge, we introduce DIRK, a novel distillation approach that leverages a rectification process to DIstill Reliable Knowledge, while concurrently preserves informative fine-grained label confidence. In addition, to harness the rectified confidence to its fullest potential, we propose a knowledge-based representation refinement module, seamlessly integrated into the DIRK framework. This module effectively transmits the essence of similarity knowledge from the label space to the feature space, thereby amplifying representation learning and subsequently engendering marked improvements in model performance. Experiments and analysis on multiple datasets validate the rationality and superiority of our proposed approach.



Paperid:1761
Authors:Fan Wu, Rui Zhang, Qi Yi, Yunkai Gao, Jiaming Guo, Shaohui Peng, Siming Lan, Husheng Han, Yansong Pan, Kaizhao Yuan, Pengwei Jin, Ruizhi Chen, Yunji Chen, Ling Li
Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China, SKL of Processors, Institute of Computing Technology, CAS, Beijing, China, University of Science and Technology of China, USTC, Hefei, China, University of Science and Technology of China, USTC, Hefei, China, SKL of Processors, Institute of Computing Technology, CAS, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China, University of Science and Technology of China, USTC, Hefei, China, SKL of Processors, Institute of Computing Technology, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China, University of Chinese Academy of Sciences, UCAS, Beijing, China, University of Chinese Academy of Sciences, UCAS, Beijing, China, University of Chinese Academy of Sciences, UCAS, Beijing, China SKL of Processors, Institute of Computing Technology, CAS, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China, SKL of Processors, Institute of Computing Technology, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China, Intelligent Software Research Center, Institute of Software, CAS, Beijing, China University of Chinese Academy of Sciences, UCAS, Beijing, China
Abstract:
Modelbased offline reinforcement learning (RL) algorithms have emerged as a promising paradigm for offline RL. These algorithms usually learn a dynamics model from a static dataset of transitions, use the model to generate synthetic trajectories, and perform conservative policy optimization within these trajectories. However, our observations indicate that policy optimization methods used in these model-based offline RL algorithms are not effective at exploring the learned model and induce biased exploration, which ultimately impairs the performance of the algorithm. To address this issue, we propose Offline Conservative ExplorAtioN (OCEAN), a novel rollout approach to model-based offline RL. In our method, we incorporate additional exploration techniques and introduce three conservative constraints based on uncertainty estimation to mitigate the potential impact of significant dynamic errors resulting from exploratory transitions. Our work is a plug-in method and can be combined with classical model-based RL algorithms, such as MOPO, COMBO, and RAMBO. Experiment results of our method on the D4RL MuJoCo benchmark show that OCEAN significantly improves the performance of existing algorithms.



Paperid:1762
Authors:Hao Wu, Yuxuan Liang, Wei Xiong, Zhengyang Zhou, Wei Huang, Shilong Wang, Kun Wang
University of Science and Technology of China, The Hong Kong University of Science and Technology (Guangzhou), Tsinghua University, University of Science and Technology of China, University of Tokyo, University of Science and Technology of China, University of Science and Technology of China
Abstract:
Efficiently modeling spatiotemporal (ST) physical processes and observations presents a challenging problem for the deep learning community. Many recent studies have concentrated on meticulously reconciling various advantages, leading to designed models that are neither simple nor practical. To address this issue, this paper presents a systematic study on existing shortcomings faced by off-the-shelf models, including lack of local fidelity, poor prediction performance over long time-steps, low scalability, and inefficiency. To systematically address the aforementioned problems, we propose an EarthFarseer, a concise framework that combines parallel local convolutions and global Fourier-based transformer architectures, enabling dynamically capture the local-global spatial interactions and dependencies. EarthFarseer also incorporates a multi-scale fully convolutional and Fourier architectures to efficiently and effectively capture the temporal evolution. Our proposal demonstrates strong adaptability across various tasks and datasets, with fast convergence and better local fidelity in long time-steps predictions. Extensive experiments and visualizations over eight human society physical and natural physical datasets demonstrates the state-of-the-art performance of EarthFarseer. We release our code at https://github.com/easylearningscores/EarthFarseer.



Paperid:1763
Authors:Haochen Wu, Shubham Sharma, Sunandita Patra, Sriram Gopalakrishnan
University of Michigan, J.P. Morgan AI Research, J.P. Morgan AI Research, J.P. Morgan AI Research
Abstract:
With the growing use of machine learning (ML) models in critical domains such as finance and healthcare, the need to offer recourse for those adversely affected by the decisions of ML models has become more important; individuals ought to be provided with recommendations on actions to take for improving their situation and thus receiving a favorable decision. Prior work on sequential algorithmic recoursewhich recommends a series of changes---focuses on action feasibility and uses the proximity of feature changes to determine action costs. However, the uncertainties of feature changes and the risk of higher than average costs in recourse have not been considered. It is undesirable if a recourse could (with some probability) result in a worse situation from which recovery requires an extremely high cost. It is essential to incorporate risks when computing and evaluating recourse. We call the recourse computed with such risk considerations as Safe Algorithmic Recourse (SafeAR). The objective is to empower people to choose a recourse based on their risk tolerance. In this work, we discuss and show how existing recourse desiderata can fail to capture the risk of higher costs. We present a method to compute recourse policies that consider variability in cost and connect algorithmic recourse literature with risk-sensitive reinforcement learning. We also adopt measures "Value at Risk" and "Conditional Value at Risk" from the financial literature to summarize risk concisely. We apply our method to two real-world datasets and compare policies with different risk-aversion levels using risk measures and recourse desiderata (sparsity and proximity).



Paperid:1764
Authors:Jing Wu, Suiyao Chen, Qi Zhao, Renat Sergazinov, Chen Li, Shengjie Liu, Chongchao Zhao, Tianpei Xie, Hanqing Guo, Cheng Ji, Daniel Cociorva, Hakan Brunzell
Amazon, Amazon, Amazon, Amazon, Amazon, Amazon, Amazon, Amazon, Amazon, Amazon, Amazon, Amazon
Abstract:
Selfsupervised representation learning methods have achieved significant success in computer vision and natural language processing (NLP), where data samples exhibit explicit spatial or semantic dependencies. However, applying these methods to tabular data is challenging due to the less pronounced dependencies among data samples. In this paper, we address this limitation by introducing SwitchTab, a novel self-supervised method specifically designed to capture latent dependencies in tabular data. SwitchTab leverages an asymmetric encoder-decoder framework to decouple mutual and salient features among data pairs, resulting in more representative embeddings. These embeddings, in turn, contribute to better decision boundaries and lead to improved results in downstream tasks. To validate the effectiveness of SwitchTab, we conduct extensive experiments across various domains involving tabular data. The results showcase superior performance in end-to-end prediction tasks with fine-tuning. Moreover, we demonstrate that pre-trained salient embeddings can be utilized as plug-and-play features to enhance the performance of various traditional classification methods (e.g., Logistic Regression, XGBoost, etc.). Lastly, we highlight the capability of SwitchTab to create explainable representations through visualization of decoupled mutual and salient features in the latent space.



Paperid:1765
Authors:Jizhou Wu, Jianye Hao, Tianpei Yang, Xiaotian Hao, Yan Zheng, Weixun Wang, Matthew E. Taylor
Tianjin University, Tianjin University, University of Alberta, Tianjin University, Tianjin University, Netease Fuxi AI Lab, University of Alberta
Abstract:
Despite many breakthroughs in recent years, it is still hard for MultiAgent Reinforcement Learning (MARL) algorithms to directly solve complex tasks in MultiAgent Systems (MASs) from scratch. In this work, we study how to use Automatic Curriculum Learning (ACL) to reduce the number of environmental interactions required to learn a good policy. In order to solve a difficult task, ACL methods automatically select a sequence of tasks (i.e., curricula). The idea is to obtain maximum learning progress towards the final task by continuously learning on tasks that match the current capabilities of the learners. The key question is how to measure the learning progress of the learner for better curriculum selection. We propose a novel ACL framework, PrOgRessive mulTiagent Automatic curricuLum (PORTAL), for MASs. PORTAL selects curricula according to two critera: 1) How difficult is a task, relative to the learners’ current abilities? 2) How similar is a task, relative to the final task? By learning a shared feature space between tasks, PORTAL is able to characterize different tasks based on the distribution of features and select those that are similar to the final task. Also, the shared feature space can effectively facilitate the policy transfer between curricula. Experimental results show that PORTAL can train agents to master extremely hard cooperative tasks, which can not be achieved with previous stateof-the-art MARL algorithms.



Paperid:1766
Authors:Nannan Wu, Zhaobin Sun, Zengqiang Yan, Li Yu
School of Electronic Information and Communications, Huazhong University of Science and Technology, School of Electronic Information and Communications, Huazhong University of Science and Technology, School of Electronic Information and Communications, Huazhong University of Science and Technology, School of Electronic Information and Communications, Huazhong University of Science and Technology
Abstract:
Federated learning (FL) has emerged as a promising paradigm for training segmentation models on decentralized medical data, owing to its privacypreserving property. However, existing research overlooks the prevalent annotation noise encountered in real-world medical datasets, which limits the performance ceilings of FL. In this paper, we, for the first time, identify and tackle this problem. For problem formulation, we propose a contour evolution for modeling non-independent and identically distributed (Non-IID) noise across pixels within each client and then extend it to the case of multi-source data to form a heterogeneous noise model (i.e., Non-IID annotation noise across clients). For robust learning from annotations with such two-level Non-IID noise, we emphasize the importance of data quality in model aggregation, allowing high-quality clients to have a greater impact on FL. To achieve this, we propose Federated learning with Annotation quAlity-aware AggregatIon, named FedA3I, by introducing a quality factor based on client-wise noise estimation. Specifically, noise estimation at each client is accomplished through the Gaussian mixture model and then incorporated into model aggregation in a layer-wise manner to up-weight high-quality clients. Extensive experiments on two real-world medical image segmentation datasets demonstrate the superior performance of FedA3I against the state-of-the-art approaches in dealing with cross-client annotation noise. The code is available at https://github.com/wnn2000/FedAAAI.



Paperid:1767
Authors:Tingting Wu, Songhe Feng, Jiazheng Yuan
Tangshan Research Institute, Beijing Jiaotong University, Beijing, China Key Laboratory of Big Data and Artificial Intelligence in Transportation, Ministry of Education, Beijing, China, Tangshan Research Institute, Beijing Jiaotong University, Beijing, China Key Laboratory of Big Data and Artificial Intelligence in Transportation, Ministry of Education, Beijing, China, College of Science and Technology, Beijing Open University, Beijing, China
Abstract:
Incomplete Multiple Kernel Clustering algorithms, which aim to learn a common latent representation from preconstructed incomplete multiple kernels from the original data, followed by k-means for clustering. They have attracted intensive attention due to their high computational efficiency. However, our observation reveals that the imputation of these approaches for each kernel ignores the influence of other incomplete kernels. In light of this, we present a novel method called Low-Rank Kernel Tensor Learning for Incomplete Multiple Views Clustering (LRKT-IMVC) to address the above issue. Specifically, LRKT-IMVC first introduces the concept of kernel tensor to explore the inter-view correlations, and then the low-rank kernel tensor constraint is used to further capture the consistency information to impute missing kernel elements, thereby improving the quality of clustering. Moreover, we carefully design an alternative optimization method with promising convergence to solve the resulting optimization problem. The proposed method is compared with recent advances in experiments with different missing ratios on seven well-known datasets, demonstrating its effectiveness and the advantages of the proposed interpolation method.



Paperid:1768
Authors:Yanan Wu, Zhixiang Chi, Yang Wang, Konstantinos N. Plataniotis, Songhe Feng
Key Laboratory of Big Data & Artificial Intelligence in Transportation, Ministry of Education, Beijing Jiaotong University, Beijing, 100044, China School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China, The Edward S Rogers Sr. ECE Department, University of Toronto, Toronto, M5S3G8, Canada, Department of Computer Science and Software Engineering, Concordia University, Montreal, H3G2J1, Canada, The Edward S Rogers Sr. ECE Department, University of Toronto, Toronto, M5S3G8, Canada, Key Laboratory of Big Data & Artificial Intelligence in Transportation, Ministry of Education, Beijing Jiaotong University, Beijing, 100044, China School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Abstract:
Testtime domain adaptation aims to adapt the model trained on source domains to unseen target domains using a few unlabeled images. Emerging research has shown that the label and domain information is separately embedded in the weight matrix and batch normalization (BN) layer. Previous works normally update the whole network naively without explicitly decoupling the knowledge between label and domain. As a result, it leads to knowledge interference and defective distribution adaptation. In this work, we propose to reduce such learning interference and elevate the domain knowledge learning by only manipulating the BN layer. However, the normalization step in BN is intrinsically unstable when the statistics are re-estimated from a few samples. We find that ambiguities can be greatly reduced when only updating the two affine parameters in BN while keeping the source domain statistics. To further enhance the domain knowledge extraction from unlabeled data, we construct an auxiliary branch with label-independent self-supervised learning (SSL) to provide supervision. Moreover, we propose a bi-level optimization based on meta-learning to enforce the alignment of two learning objectives of auxiliary and main branches. The goal is to use the auxiliary branch to adapt the domain and benefit main task for subsequent inference. Our method keeps the same computational cost at inference as the auxiliary branch can be thoroughly discarded after adaptation. Extensive experiments show that our method outperforms the prior works on five WILDS real-world domain shift datasets. Our method can also be integrated with methods with label-dependent optimization to further push the performance boundary. Our code is available at https://github.com/ynanwu/MABN.



Paperid:1769
Authors:Yanru Wu, Jianning Wang, Weida Wang, Yang Li
Tsinghua University, Harbin Institute of Technology, Shenzhen, Tsinghua University, Tsinghua University
Abstract:
Multisource transfer learning is an effective solution to data scarcity by utilizing multiple source tasks for the learning of the target task. However, access to source data and model details is limited in the era of commercial models, giving rise to the setting of multi-source-free (MSF) transfer learning that aims to leverage source domain knowledge without such access. As a newly defined problem paradigm, MSF transfer learning remains largely underexplored and not clearly formulated. In this work, we adopt an information theoretic perspective on it and propose a framework named H-ensemble, which dynamically learns the optimal linear combination, or ensemble, of source models for the target task, using a generalization of maximal correlation regression. The ensemble weights are optimized by maximizing an information theoretic metric for transferability. Compared to previous works, H-ensemble is characterized by: 1) its adaptability to a novel and realistic MSF setting for few-shot target tasks, 2) theoretical reliability, 3) a lightweight structure easy to interpret and adapt. Our method is empirically validated by ablation studies, along with extensive comparative analysis with other task ensemble and transfer learning methods. We show that the H-ensemble can successfully learn the optimal task ensemble, as well as outperform prior arts.



Paperid:1770
Authors:Young Wu, Jeremy McMahan, Xiaojin Zhu, Qiaomin Xie
University of Wisconsin - Madison, University of Wisconsin - Madison, University of Wisconsin - Madison, University of Wisconsin - Madison
Abstract:
We characterize offline data poisoning attacks on MultiAgent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside it. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms.



Paperid:1771
Authors:Zongqian Wu, Yujie Mo, Peng Zhou, Shangbo Yuan, Xiaofeng Zhu
School of Computer Science and Engineering, University of Electronic Science and Technology of China, School of Computer Science and Engineering, University of Electronic Science and Technology of China, College of Computer Science and Electronic Engineering, Hunan University, School of Engineering and Design, Technical University of Munich, School of Computer Science and Engineering, University of Electronic Science and Technology of China Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China
Abstract:
Selftraining based few-shot node classification (FSNC) methods have shown excellent performance in real applications, but they cannot make the full use of the information in the base set and are easily affected by the quality of pseudo-labels. To address these issues, this paper proposes a new self-training FSNC method by involving the representation distillation and the pseudo-label distillation. Specifically, the representation distillation includes two knowledge distillation methods (i.e., the local representation distillation and the global representation distillation) to transfer the information in the base set to the novel set. The pseudo-label distillation is designed to conduct knowledge distillation on the pseudo-labels to improve their quality. Experimental results showed that our method achieves supreme performance, compared with state-of-the-art methods. Our code and a comprehensive theoretical version are available at https://github.com/zongqianwu/KD-FSNC.



Paperid:1772
Authors:Haochong Xia, Shuo Sun, Xinrun Wang, Bo An
Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University
Abstract:
Financial simulators play an important role in enhancing forecasting accuracy, managing risks, and fostering strategic financial decisionmaking. Despite the development of financial market simulation methodologies, existing frameworks often struggle with adapting to specialized simulation context. We pinpoint the challenges as i) current financial datasets do not contain context labels; ii) current techniques are not designed to generate financial data with context as control, which demands greater precision compared to other modalities; iii) the inherent difficulties in generating context-aligned, high-fidelity data given the non-stationary, noisy nature of financial data. To address these challenges, our contributions are: i) we proposed the Contextual Market Dataset with market dynamics, stock ticker, and history state as context, leveraging a market dynamics modeling method that combines linear regression and clustering to extract market dynamics; ii) we present Market-GAN, a novel architecture incorporating a Generative Adversarial Networks (GAN) for the controllable generation with context, an autoencoder for learning low-dimension features, and supervisors for knowledge transfer; iii) we introduce a two-stage training scheme to ensure that Market-GAN captures the intrinsic market distribution with multiple objectives. In the pertaining stage, with the use of the autoencoder and supervisors, we prepare the generator with a better initialization for the adversarial training stage. We propose a set of holistic evaluation metrics that consider alignment, fidelity, data usability on downstream tasks, and market facts. We evaluate Market-GAN with the Dow Jones Industrial Average data from 2000 to 2023 and showcase superior performance in comparison to 4 state-of-the-art time-series generative models.



Paperid:1773
Authors:Mingxuan Xia, Junbo Zhao, Gengyu Lyu, Zenan Huang, Tianlei Hu, Gang Chen, Haobo Wang
School of Software Technology, Zhejiang University State Key Laboratory of Blockchain and Data Security, Zhejiang University, State Key Laboratory of Blockchain and Data Security, Zhejiang University, Faculty of Information Technology, Beijing University of Technology, State Key Laboratory of Blockchain and Data Security, Zhejiang University, State Key Laboratory of Blockchain and Data Security, Zhejiang University, State Key Laboratory of Blockchain and Data Security, Zhejiang University, School of Software Technology, Zhejiang University State Key Laboratory of Blockchain and Data Security, Zhejiang University
Abstract:
Blackbox domain adaptation (BDA) targets to learn a classifier on an unsupervised target domain while assuming only access to black-box predictors trained from unseen source data. Although a few BDA approaches have demonstrated promise by manipulating the transferred labels, they largely overlook the rich underlying structure in the target domain. To address this problem, we introduce a novel separation and alignment framework for BDA. Firstly, we locate those well-adapted samples via loss ranking and a flexible confidence-thresholding procedure. Then, we introduce a novel graph contrastive learning objective that aligns under-adapted samples to their local neighbors and well-adapted samples. Lastly, the adaptation is finally achieved by a nearest-centroid-augmented objective that exploits the clustering effect in the feature space. Extensive experiments demonstrate that our proposed method outperforms best baselines on benchmark datasets, e.g. improving the averaged per-class accuracy by 4.1% on the VisDA dataset. The source code is available at: https://github.com/MingxuanXia/SEAL.



Paperid:1774
Authors:Shiyu Xia, Miaosen Zhang, Xu Yang, Ruiming Chen, Haokun Chen, Xin Geng
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, Nanjing 210096, China Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
Abstract:
We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the welltrained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2× training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19× parameters stored to initialize these models and around 5× pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9× parameters stored to initialize, compared to the pre-training approach.



Paperid:1775
Authors:Jingge Xiao, Leonie Basso, Wolfgang Nejdl, Niloy Ganguly, Sandipan Sikdar
L3S Research Center, Leibniz University Hannover, L3S Research Center, Leibniz University Hannover, L3S Research Center, Leibniz University Hannover, Indian Institute of Technology Kharagpur, L3S Research Center, Leibniz University Hannover
Abstract:
Continuoustime models such as Neural ODEs and Neural Flows have shown promising results in analyzing irregularly sampled time series frequently encountered in electronic health records. Based on these models, time series are typically processed with a hybrid of an initial value problem (IVP) solver and a recurrent neural network within the variational autoencoder architecture. Sequentially solving IVPs makes such models computationally less efficient. In this paper, we propose to model time series purely with continuous processes whose state evolution can be approximated directly by IVPs. This eliminates the need for recurrent computation and enables multiple states to evolve in parallel. We further fuse the encoder and decoder with one IVP solver utilizing its invertibility, which leads to fewer parameters and faster convergence. Experiments on three real-world datasets show that the proposed method can systematically outperform its predecessors, achieve state-of-the-art results, and have significant advantages in terms of data efficiency.



Paperid:1776
Authors:Tingxiong Xiao, Runzhao Yang, Yuxiao Cheng, Jinli Suo
Tsinghua University, Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
Solving partial differential equations (PDEs) has been a fundamental problem in computational science and of wide applications for both scientific and engineering research. Due to its universal approximation property, neural network is widely used to approximate the solutions of PDEs. However, existing works are incapable of solving highorder PDEs due to insufficient calculation accuracy of higher-order derivatives, and the final network is a black box without explicit explanation. To address these issues, we propose a deep learning framework to solve high-order PDEs, named SHoP. Specifically, we derive the high-order derivative rule for neural network, to get the derivatives quickly and accurately; moreover, we expand the network into a Taylor series, providing an explicit solution for the PDEs. We conduct experimental validations four high-order PDEs with different dimensions, showing that we can solve high-order PDEs efficiently and accurately. The source code can be found at https://github.com/HarryPotterXTX/SHoP.git.



Paperid:1777
Authors:Binghui Xie, Yongqiang Chen, Jiaqi Wang, Kaiwen Zhou, Bo Han, Wei Meng, James Cheng
The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, Hong Kong Baptist University, The Chinese University of Hong Kong, The Chinese University of Hong Kong
Abstract:
Domain generalization is a critical challenge for machine learning systems. Prior domain generalization methods focus on extracting domaininvariant features across several stationary domains to enable generalization to new domains. However, in non-stationary tasks where new domains evolve in an underlying continuous structure, such as time, merely extracting the invariant features is insufficient for generalization to the evolving new domains. Nevertheless, it is non-trivial to learn both evolving and invariant features within a single model due to their conflicts. To bridge this gap, we build causal models to characterize the distribution shifts concerning the two patterns, and propose to learn both dynamic and invariant features via a new framework called Mutual Information-Based Sequential Autoencoders (MISTS). MISTS adopts information theoretic constraints onto sequential autoencoders to disentangle the dynamic and invariant features, and leverage an adaptive classifier to make predictions based on both evolving and invariant information. Our experimental results on both synthetic and real-world datasets demonstrate that MISTS succeeds in capturing both evolving and invariant information, and present promising results in evolving domain generalization tasks.



Paperid:1778
Authors:Chenghan Xie, Chenxi Li, Chuwen Zhang, Qi Deng, Dongdong Ge, Yinyu Ye
School of Information Management and Engineering, Shanghai University of Finance and Economics School of Mathematical Sciences, Fudan University, School of Information Management and Engineering, Shanghai University of Finance and Economics, School of Information Management and Engineering, Shanghai University of Finance and Economics, School of Information Management and Engineering, Shanghai University of Finance and Economics, School of Information Management and Engineering, Shanghai University of Finance and Economics, Department of Management Science and Engineering, Stanford University
Abstract:
In many important machine learning applications, the standard assumption of having a globally Lipschitz continuous gradient may fail to hold. This paper delves into a more general (L0, L1)smoothness setting, which gains particular significance within the realms of deep neural networks and distributionally robust optimization (DRO). We demonstrate the significant advantage of trust region methods for stochastic nonconvex optimization under such generalized smoothness assumption. We show that first-order trust region methods can recover the normalized and clipped stochastic gradient as special cases and then provide a unified analysis to show their convergence to first-order stationary conditions. Motivated by the important application of DRO, we propose a generalized high-order smoothness condition, under which second-order trust region methods can achieve a complexity of O(epsilon(-3.5)) for convergence to second-order stationary points. By incorporating variance reduction, the second-order trust region method obtains an even better complexity of O(epsilon(-3)), matching the optimal bound for standard smooth optimization. To our best knowledge, this is the first work to show convergence beyond the first-order stationary condition for generalized smooth optimization. Preliminary experiments show that our proposed algorithms perform favorably compared with existing methods.



Paperid:1779
Authors:Zheng Xie, Yu Liu, Ming Li
Nanjing University, Nanjing University, Nanjing University
Abstract:
Weakly supervised learning aims to make machine learning more powerful when the perfect supervision is unavailable, and has attracted much attention from researchers. Among the various scenarios of weak supervision, one of the most challenging cases is learning from multiple unlabeled (U) datasets with only a little knowledge of the class priors, or U^m learning for short. In this paper, we study the problem of building an AUC (area under ROC curve) optimal model from multiple unlabeled datasets, which maximizes the pairwise ranking ability of the classifier. We propose U^mAUC, an AUC optimization approach that converts the U^m data into a multi-label AUC optimization problem, and can be trained efficiently. We show that the proposed U^m-AUC is effective theoretically and empirically.



Paperid:1780
Authors:Zhitian Xie, Yinger Zhang, Chenyi Zhuang, Qitao Shi, Zhining Liu, Jinjie Gu, Guannan Zhang
Ant Group, Zhejiang University, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group
Abstract:
The application of mixtureof-experts (MoE) is gaining popularity due to its ability to improve model's performance. In an MoE structure, the gate layer plays a significant role in distinguishing and routing input features to different experts. This enables each expert to specialize in processing their corresponding sub-tasks. However, the gate's routing mechanism also gives rise to "narrow vision": the individual MoE's expert fails to use more samples in learning the allocated subtask, which in turn limits the MoE to further improve its generalization ability. To effectively address this, we propose a method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts and gain more accurate perceptions on their allocated sub-tasks. We conduct plenty experiments including tabular, NLP and CV datasets, which shows MoDE's effectiveness, universality and robustness. Furthermore, we develop a parallel study through innovatively constructing "expert probing", to experimentally prove why MoDE works: moderate distilling knowledge from other experts can improve each individual expert's test performances on their assigned tasks, leading to MoE's overall performance improvement.



Paperid:1781
Authors:Yi Xin, Junlong Du, Qiang Wang, Ke Yan, Shouhong Ding
Nanjing University Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab
Abstract:
MultiTask Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific decoders. However, the complexity of the decoders increases with the number of tasks. To tackle this challenge, we integrate the decoder-free vision-language model CLIP, which exhibits robust zero-shot generalization capability. Recently, parameter-efficient transfer learning methods have been extensively explored with CLIP for adapting to downstream tasks, where prompt tuning showcases strong potential. Nevertheless, these methods solely fine-tune a single modality (text or visual), disrupting the modality structure of CLIP. In this paper, we first propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process. Building upon MmAP, we develop an innovative multi-task prompt learning framework. On the one hand, to maximize the complementarity of tasks with high similarity, we utilize a gradient-driven task grouping method that partitions tasks into several disjoint groups and assign a group-shared MmAP to each group. On the other hand, to preserve the unique characteristics of each task, we assign an task-specific MmAP to each task. Comprehensive experiments on two large multi-task learning datasets demonstrate that our method achieves significant performance improvements compared to full fine-tuning while only utilizing approximately ~ 0.09% of trainable parameters.



Paperid:1782
Authors:Yi Xin, Junlong Du, Qiang Wang, Zhiwen Lin, Ke Yan
Nanjing University Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab, Tencent Youtu Lab
Abstract:
Largescale pre-trained models have achieved remarkable success in various computer vision tasks. A standard approach to leverage these models is to fine-tune all model parameters for downstream tasks, which poses challenges in terms of computational and storage costs. Recently, inspired by Natural Language Processing (NLP), parameter-efficient transfer learning has been successfully applied to vision tasks. However, most existing techniques primarily focus on single-task adaptation, and despite limited research on multi-task adaptation, these methods often exhibit suboptimal training/inference efficiency. In this paper, we first propose an once-for-all Vision Multi-Task Adapter (VMT-Adapter), which strikes approximately O(1) training and inference efficiency w.r.t task number. Concretely, VMT-Adapter shares the knowledge from multiple tasks to enhance cross-task interaction while preserves task-specific knowledge via independent knowledge extraction modules. Notably, since task-specific modules require few parameters, VMT-Adapter can handle an arbitrary number of tasks with a negligible increase of trainable parameters. We also propose VMT-Adapter-Lite, which further reduces the trainable parameters by learning shared parameters between down- and up-projections. Extensive experiments on four dense scene understanding tasks demonstrate the superiority of VMT-Adapter(-Lite), achieving a 3.96% (1.34%) relative improvement compared to single-task full fine-tuning, while utilizing merely ~1% (0.36%) trainable parameters of the pre-trained model.



Paperid:1783
Authors:Xingrun Xing, Li Du, Xinyuan Wang, Xianlin Zeng, Yequan Wang, Zheng Zhang, Jiajun Zhang
Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Beijing Academy of Artificial Intelligence, Beijing Academy of Artificial Intelligence, Beihang University, Beihang University, Beijing Academy of Artificial Intelligence, Beijing Academy of Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences
Abstract:
Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal taskagnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the binarization error in self-attention operations and derive the polynomials of binarization error. To simulate full-precision self-attention, we define binarization error as binarization residual polynomials, and then introduce low-rank estimators to model these polynomials. Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark. BiPFT also demonstrates improved robustness to hyperparameter changes, improved optimization efficiency, and reduced reliance on downstream distillation, which consequently generalize on various NLU tasks and simplify the downstream pipeline of BNNs. Our code and pretrained models are publicly available at https://github.com/Xingrun-Xing/BiPFT.



Paperid:1784
Authors:Guojun Xiong, Gang Yan, Shiqiang Wang, Jian Li
Stony Brook University, Binghamton University, IBM T. J. Watson Research Center, Stony Brook University
Abstract:
Decentralized learning has emerged as an alternative method to the popular parameterserver framework which suffers from high communication burden, single-point failure and scalability issues due to the need of a central server. However, most existing works focus on a single shared model for all workers regardless of the data heterogeneity problem, rendering the resulting model performing poorly on individual workers. In this work, we propose a novel personalized decentralized learning algorithm named DePRL via shared representations. Our algorithm relies on ideas from representation learning theory to learn a low-dimensional global representation collaboratively among all workers in a fully decentralized manner, as well as a user-specific low-dimensional local head leading to a personalized solution for each worker. We show that DePRL achieves, for the first time, a provable \textit{linear speedup for convergence} with general non-linear representations (i.e., the convergence rate is improved linearly with respect to the number of workers). Experimental results support our theoretical findings showing the superiority of our method in data heterogeneous environments.



Paperid:1785
Authors:Siheng Xiong, Yuan Yang, Ali Payani, James C Kerce, Faramarz Fekri
Georgia Institute of Technology, Georgia Institute of Technology, Cisco Systems Inc., Georgia Institute of Technology, Georgia Institute of Technology
Abstract:
Conventional embeddingbased models approach event time prediction in temporal knowledge graphs (TKGs) as a ranking problem. However, they often fall short in capturing essential temporal relationships such as order and distance. In this paper, we propose TEILP, a logical reasoning framework that naturaly integrates such temporal elements into knowledge graph predictions. We first convert TKGs into a temporal event knowledge graph (TEKG) which has a more explicit representation of time in term of nodes of the graph. The TEKG equips us to develop a differentiable random walk approach to time prediction. Finally, we introduce conditional probability density functions, associated with the logical rules involving the query interval, using which we arrive at the time prediction. We compare TEILP with state-of-the-art methods on five benchmark datasets. We show that our model achieves a significant improvement over baselines while providing interpretable explanations. In particular, we consider several scenarios where training samples are limited, event types are imbalanced, and forecasting the time of future events based on only past events is desired. In all these cases, TEILP outperforms state-of-the-art methods in terms of robustness.



Paperid:1786
Authors:Zikai Xiong, Niccolò Dalmasso, Alan Mishler, Vamsi K. Potluru, Tucker Balch, Manuela Veloso
Massachusetts Institute of Technology, J.P. Morgan AI Research, J.P. Morgan AI Research, J.P. Morgan AI Research, J.P. Morgan AI Research, J.P. Morgan AI Research
Abstract:
Recent years have seen a surge of machine learning approaches aimed at reducing disparities in model outputs across different subgroups. In many settings, training data may be used in multiple downstream applications by different users, which means it may be most effective to intervene on the training data itself. In this work, we present FairWASP, a novel preprocessing approach designed to reduce disparities in classification datasets without modifying the original data. FairWASP returns sample-level weights such that the reweighted dataset minimizes the Wasserstein distance to the original dataset while satisfying (an empirical version of) demographic parity, a popular fairness criterion. We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples. FairWASP can therefore be used to construct datasets which can be fed into any classification method, not just methods which accept sample weights. Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method. Experiments demonstrate that our proposed optimization algorithm significantly outperforms state-of-the-art commercial solvers in solving both the MIP and its linear program relaxation. Further experiments highlight the competitive performance of FairWASP in reducing disparities while preserving accuracy in downstream classification settings.



Paperid:1787
Authors:Cai Xu, Jiajun Si, Ziyu Guan, Wei Zhao, Yue Wu, Xiyue Gao
Xidian University, Xidian University, Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
Multiview learning aims to combine multiple features to achieve more comprehensive descriptions of data. Most previous works assume that multiple views are strictly aligned. However, real-world multi-view data may contain low-quality conflictive instances, which show conflictive information in different views. Previous methods for this problem mainly focus on eliminating the conflictive data instances by removing them or replacing conflictive views. Nevertheless, real-world applications usually require making decisions for conflictive instances rather than only eliminating them. To solve this, we point out a new Reliable Conflictive Multi-view Learning (RCML) problem, which requires the model to provide decision results and attached reliabilities for conflictive multi-view data. We develop an Evidential Conflictive Multi-view Learning (ECML) method for this problem. ECML first learns view-specific evidence, which could be termed as the amount of support to each category collected from data. Then, we can construct view-specific opinions consisting of decision results and reliability. In the multi-view fusion stage, we propose a conflictive opinion aggregation strategy and theoretically prove this strategy can exactly model the relation of multi-view common and view-specific reliabilities. Experiments performed on 6 datasets verify the effectiveness of ECML. The code is released at https://github.com/jiajunsi/RCML.



Paperid:1788
Authors:Fan Xu, Yu Zhao, Bingzhe Wu, Yueshan Huang, Qin Ren, Yang Xiao, Bing He, Jie Zheng, Jianhua Yao
ShanghaiTech University, Tencent AI Lab, Tencent AI Lab, Shanghai Jiao Tong University, Tencent AI Lab, Tsinghua University, Tencent AI Lab, ShanghaiTech University, Tencent AI Lab
Abstract:
One individual human’s immune repertoire consists of a huge set of adaptive immune receptors at a certain time point, representing the individual's adaptive immune state. Immune repertoire classification and associated receptor identification have the potential to make a transformative contribution to the development of novel vaccines and therapies. The vast number of instances and exceedingly low witness rate pose a great challenge to the immune repertoire classification, which can be formulated as a Massive Multiple Instance Learning (MMIL) problem. Traditional MIL methods, at both baglevel and instance-level, confront the issues of substantial computational burden or supervision ambiguity when handling massive instances. To address these issues, we propose a novel label disambiguation-based multimodal massive multiple instance learning approach (LaDM³IL) for immune repertoire classification. LaDM³IL adapts the instance-level MIL paradigm to deal with the issue of high computational cost and employs a specially-designed label disambiguation module for label correction, mitigating the impact of misleading supervision. To achieve a more comprehensive representation of each receptor, LaDM³IL leverages a multimodal fusion module with gating-based attention and tensor-fusion to integrate the information from gene segments and amino acid (AA) sequences of each immune receptor. Extensive experiments on the Cytomegalovirus (CMV) and Cancer datasets demonstrate the superior performance of the proposed LaDM³IL for both immune repertoire classification and associated receptor identification tasks. The code is publicly available at https://github.com/Josie-xufan/LaDM3IL.



Paperid:1789
Authors:Gehui Xu, Jie Wen, Chengliang Liu, Bing Hu, Yicheng Liu, Lunke Fei, Wei Wang
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Guangdong University of Technology, Guangzhou, Harbin Institute of Technology, Shenzhen
Abstract:
Incomplete multiview clustering (IMVC) aims to reveal shared clustering structures within multi-view data, where only partial views of the samples are available. Existing IMVC methods primarily suffer from two issues: 1) Imputation-based methods inevitably introduce inaccurate imputations, which in turn degrade clustering performance; 2) Imputation-free methods are susceptible to unbalanced information among views and fail to fully exploit shared information. To address these issues, we propose a novel method based on variational autoencoders. Specifically, we adopt multiple view-specific encoders to extract information from each view and utilize the Product-of-Experts approach to efficiently aggregate information to obtain the common representation. To enhance the shared information in the common representation, we introduce a coherence objective to mitigate the influence of information imbalance. By incorporating the Mixture-of-Gaussians prior information into the latent representation, our proposed method is able to learn the common representation with clustering-friendly structures. Extensive experiments on four datasets show that our method achieves competitive clustering performance compared with state-of-the-art methods.



Paperid:1790
Authors:Jian Xu, Delu Zeng
School of Mathematics, South China University of Technology (SCUT), South China University of Technology
Abstract:
The theory of Bayesian learning incorporates the use of Studentt Processes to model heavy-tailed distributions and datasets with outliers. However, despite Student-t Processes having a similar computational complexity as Gaussian Processes, there has been limited emphasis on the sparse representation of this model. This is mainly due to the increased difficulty in modeling and computation compared to previous sparse Gaussian Processes. Our motivation is to address the need for a sparse representation framework that reduces computational complexity, allowing Student-t Processes to be more flexible for real-world datasets. To achieve this, we leverage the conditional distribution of Student-t Processes to introduce sparse inducing points. Bayesian methods and variational inference are then utilized to derive a well-defined lower bound, facilitating more efficient optimization of our model through stochastic gradient descent. We propose two methods for computing the variational lower bound, one utilizing Monte Carlo sampling and the other employing Jensen's inequality to compute the KL regularization term in the loss function. We propose adopting these approaches as viable alternatives to Gaussian processes when the data might contain outliers or exhibit heavy-tailed behavior, and we provide specific recommendations for their applicability. We evaluate the two proposed approaches on various synthetic and real-world datasets from UCI and Kaggle, demonstrating their effectiveness compared to baseline methods in terms of computational complexity and accuracy, as well as their robustness to outliers.



Paperid:1791
Authors:Jiawei Xu, Cheng Zhou, Yizheng Zhang, Baoxiang Wang, Lei Han
Tencent Robotics X The Chinese University of Hong Kong, Shenzhen, Tencent Robotics X, Tencent Robotics X, The Chinese University of Hong Kong, Shenzhen, Tencent Robotics X
Abstract:
We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning to measure the relativity gap between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which offer fast policy transfer and dynamics modelling, respectively. RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments. Integrating the two algorithms results in the complete Relative PolicyTransition Optimization (RPTO) algorithm, in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating policy transfer problems via variant dynamics.



Paperid:1792
Authors:Jiaxing Xu, Aihu Zhang, Qingtian Bian, Vijay Prakash Dwivedi, Yiping Ke
Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University
Abstract:
Graph Neural Networks (GNNs) are widely used for graph representation learning in many application domains. The expressiveness of vanilla GNNs is upperbounded by 1-dimensional Weisfeiler-Leman (1-WL) test as they operate on rooted subtrees through iterative message passing. In this paper, we empower GNNs by injecting neighbor-connectivity information extracted from a new type of substructure. We first investigate different kinds of connectivities existing in a local neighborhood and identify a substructure called union subgraph, which is able to capture the complete picture of the 1-hop neighborhood of an edge. We then design a shortest-path-based substructure descriptor that possesses three nice properties and can effectively encode the high-order connectivities in union subgraphs. By infusing the encoded neighbor connectivities, we propose a novel model, namely Union Subgraph Neural Network (UnionSNN), which is proven to be strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. Additionally, the local encoding from union subgraphs can also be injected into arbitrary message-passing neural networks (MPNNs) and Transformer-based models as a plugin. Extensive experiments on 18 benchmarks of both graph-level and node-level tasks demonstrate that UnionSNN outperforms state-of-the-art baseline models, with competitive computational efficiency. The injection of our local encoding to existing models is able to boost the performance by up to 11.09%. Our code is available at https://github.com/AngusMonroe/UnionSNN.



Paperid:1793
Authors:Jiaxuan Xu, Taiyong Li, Lei Duan
School of Computer Science, Sichuan University, Chengdu, China, School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China, School of Computer Science, Sichuan University, Chengdu, China
Abstract:
Ensemble clustering learns more accurate consensus results from a set of weak base clustering results. This technique is more challenging than other clustering algorithms due to the base clustering result set's randomness and the inaccessibility of data features. Existing ensemble clustering methods rely on the Coassociation (CA) matrix quality but lack the capability to handle missing connections in base clustering. Inspired by the neighborhood high-order and topological similarity theories, this paper proposes a topological ensemble model based on high-order information. Specifically, this paper compensates for missing connections by mining neighborhood high-order connection information in the CA matrix and learning optimal connections with adaptive weights. Afterward, the learned excellent connections are embedded into topology learning to capture the topology of the base clustering. Finally, we incorporate adaptive high-order connection representation and topology learning into a unified learning framework. To our knowledge, this is the first ensemble clustering work based on topological similarity and high-order connectivity relations. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed method. The source code of the proposed approach is available at https://github.com/ltyong/awec.



Paperid:1794
Authors:Ke Xu, Zhongcheng Li, Shanshan Wang, Xingyi Zhang
Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University School of Artificial Intelligence, Anhui University, School of Artificial Intelligence, Anhui University, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University School of Computer Science and Technology, Anhui University
Abstract:
The ability of model quantization with arbitrary bitwidth to dynamically meet diverse bit-width requirements during runtime has attracted significant attention. Recent research has focused on optimizing large-scale training methods to achieve robust bit-width adaptation, which is a time-consuming process requiring hundreds of GPU hours. Furthermore, converting bit-widths requires recalculating statistical parameters of the norm layers, thereby impeding real-time switching of the bit-width. To overcome these challenges, we propose an efficient Post-Training Multi-bit Quantization (PTMQ) scheme that requires only a small amount of calibration data to perform block-wise reconstruction of multi-bit quantization errors. It eliminates the influence of statistical parameters by fusing norm layers, and supports real-time switching bit-widths in uniform quantization and mixed-precision quantization. To improve quantization accuracy and robustness, we propose a Multi-bit Feature Mixer technique (MFM) for fusing features of different bit-widths to enhance robustness across varying bit-widths. Moreover, we introduced the Group-wise Distillation Loss (GD-Loss) to enhance the correlation between different bit-width groups and further improve the overall performance of PTMQ. Extensive experiments demonstrate that PTMQ achieves comparable performance to existing state-of-the-art post-training quantization methods, while optimizing it speeds up by 100$\times$ compared to recent multi-bit quantization works. Code can be available at https://github.com/xuke225/PTMQ.



Paperid:1795
Authors:Kunlun Xu, Xu Zou, Jiahuan Zhou
Peking University, Huazhong University of Science and Technology, Peking University
Abstract:
Lifelong person reidentification (LReID) aims to train a unified model from diverse data sources step by step. The severe domain gaps between different training steps result in catastrophic forgetting in LReID, and existing methods mainly rely on data replay and knowledge distillation techniques to handle this issue. However, the former solution needs to store historical exemplars which inevitably impedes data privacy. The existing knowledge distillation-based models usually retain all the knowledge of the learned old models without any selections, which will inevitably include erroneous and detrimental knowledge that severely impacts the learning performance of the new model. To address these issues, we propose an exemplar-free LReID method named LongShort Term Knowledge Consolidation (LSTKC) that contains a Rectification-based Short-Term Knowledge Transfer module (R-STKT) and an Estimation-based Long-Term Knowledge Consolidation module (E-LTKC). For each learning iteration within one training step, R-STKT aims to filter and rectify the erroneous knowledge contained in the old model and transfer the rectified knowledge to facilitate the short-term learning of the new model. Meanwhile, once one training step is finished, E-LTKC proposes to further consolidate the learned long-term knowledge via adaptively fusing the parameters of models from different steps. Consequently, experimental results show that our LSTKC exceeds the state-of-the-art methods by 6.3%/9.4% and 7.9%/4.5%, 6.4%/8.0% and 9.0%/5.5% average mAP/R@1 on seen and unseen domains under two different training orders of the challenging LReID benchmark respectively.



Paperid:1796
Authors:Shixiong Xu, Gaofeng Meng, Xing Nie, Bolin Ni, Bin Fan, Shiming Xiang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Centre for Artificial Intelligence and Robotics, HK Institute of Science & Innovation, Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, School of Intelligence Science and Technology, University of Science and Technology, Beijing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences
Abstract:
We observe a high level of imbalance in the accuracy of different learned classes in the same old task for the first time. This intriguing phenomenon, discovered in replaybased Class Incremental Learning (CIL), highlights the imbalanced forgetting of learned classes, as their accuracy is similar before the occurrence of catastrophic forgetting. This discovery remains previously unidentified due to the reliance on average incremental accuracy as the measurement for CIL, which assumes that the accuracy of classes within the same task is similar. However, this assumption is invalid in the face of catastrophic forgetting. Further empirical studies indicate that this imbalanced forgetting is caused by conflicts in representation between semantically similar old and new classes. These conflicts are rooted in the data imbalance present in replay-based CIL methods. Building on these insights, we propose CLass-Aware Disentanglement (CLAD) as a means to predict the old classes that are more likely to be forgotten and enhance their accuracy. Importantly, CLAD can be seamlessly integrated into existing CIL methods. Extensive experiments demonstrate that CLAD consistently improves current replay-based methods, resulting in performance gains of up to 2.56%.



Paperid:1797
Authors:Wangkun Xu, Jianhong Wang, Fei Teng
Department of EEE, Imperial College London, UK, Center for AI Fundamentals, University of Manchester, UK, Department of EEE, Imperial College London, UK
Abstract:
Successful machine learning involves a complete pipeline of data, model, and downstream applications. Instead of treating them separately, there has been a prominent increase of attention within the constrained optimization (CO) and machine learning (ML) communities towards combining prediction and optimization models. The socalled end-to-end (E2E) learning captures the task-based objective for which they will be used for decision making. Although a large variety of E2E algorithms have been presented, it has not been fully investigated how to systematically address uncertainties involved in such models. Most of the existing work considers the uncertainties of ML in the input space and improves robustness through adversarial training. We extend this idea to E2E learning and prove that there is a robustness certification procedure by solving augmented integer programming. Furthermore, we show that neglecting the uncertainty of COs during training causes a new trigger for generalization errors. To include all these components, we propose a unified framework that covers the uncertainties emerging in both the input feature space of the ML models and the COs. The framework is described as a robust optimization problem and is practically solved via end-to-end adversarial training (E2E-AT). Finally, the performance of E2E-AT is evaluated by a real-world end-to-end power system operation problem, including load forecasting and sequential scheduling tasks.



Paperid:1798
Authors:Zhengqin Xu, Yulun Zhang, Chao Ma, Yichao Yan, Zelin Peng, Shoulie Xie, Shiqian Wu, Xiaokang Yang
Shanghai Jiao Tong University, ETH Zurich, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Institute for Infocomm Research, Singapore 138632, School of Information Science and Engineering, Wuhan University of Science and Technology, Shanghai Jiao Tong University of China
Abstract:
A fundamental task in the realms of computer vision, LowRank Matrix Recovery (LRMR) focuses on the inherent low-rank structure precise recovery from incomplete data and/or corrupted measurements given that the rank is a known prior or accurately estimated. However, it remains challenging for existing rank estimation methods to accurately estimate the rank of an ill-conditioned matrix. Also, existing LRMR optimization methods are heavily dependent on the chosen parameters, and are therefore difficult to adapt to different situations. Addressing these issues, A novel LEarning-based low-rank matrix recovery with Rank Estimation (LERE) is proposed. More specifically, considering the characteristics of the Gerschgorin disk's center and radius, a new heuristic decision rule in the Gerschgorin Disk Theorem is significantly enhanced and the low-rank boundary can be exactly located, which leads to a marked improvement in the accuracy of rank estimation. According to the estimated rank, we select row and column sub-matrices from the observation matrix by uniformly random sampling. A 17-iteration feedforward-recurrent-mixed neural network is then adapted to learn the parameters in the sub-matrix recovery processing. Finally, by the correlation of the row sub-matrix and column sub-matrix, LERE successfully recovers the underlying low-rank matrix. Overall, LERE is more efficient and robust than existing LRMR methods. Experimental results demonstrate that LERE surpasses state-of-the-art (SOTA) methods. The code for this work is accessible at https://github.com/zhengqinxu/LERE.



Paperid:1799
Authors:Bo Xue, Ji Cheng, Fei Liu, Yimu Wang, Qingfu Zhang
City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China, City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China, City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China, University of Waterloo, Waterloo, Canada, City University of Hong Kong, Hong Kong, China The City University of Hong Kong Shenzhen Research Institute, Shenzhen, China
Abstract:
This paper studies the multiobjective bandit problem under lexicographic ordering, wherein the learner aims to simultaneously maximize ? objectives hierarchically. The only existing algorithm for this problem considers the multiarmed bandit model, and its regret bound is O((KT)^(2/3)) under a metric called priority-based regret. However, this bound is suboptimal, as the lower bound for single objective multi-armed bandits is Omega(KlogT). Moreover, this bound becomes vacuous when the arm number K is infinite. To address these limitations, we investigate the multiobjective Lipschitz bandit model, which allows for an infinite arm set. Utilizing a newly designed multi-stage decision-making strategy, we develop an improved algorithm that achieves a general regret bound of O(T^((d_z^i+1)/(d_z^i+2))) for the i-th objective, where d_z^i is the zooming dimension for the i-th objective, with i in {1,2,...,m}. This bound matches the lower bound of the single objective Lipschitz bandit problem in terms of T, indicating that our algorithm is almost optimal. Numerical experiments confirm the effectiveness of our algorithm.



Paperid:1800
Authors:Yangkai Xue, Jindou Dai, Zhipeng Lu, Yuwei Wu, Yunde Jia
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China, Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China, Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China, Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China, Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China
Abstract:
Hyperbolic graph convolutional networks (HGCNs) have demonstrated representational capabilities of modeling hierarchicalstructured graphs. However, as in general GCNs, over-smoothing may occur as the number of model layers increases, limiting the representation capabilities of most current HGCN models. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem. We introduce a hyperbolic residual connection function to overcome the over-smoothing problem, and also theoretically prove the effectiveness of the hyperbolic residual function. Moreover, we use product manifolds and HyperDrop to facilitate the R-HGCNs. The distinctive features of the R-HGCNs are as follows: (1) The hyperbolic residual connection preserves the initial node information in each layer and adds a hyperbolic identity mapping to prevent node features from being indistinguishable. (2) Product manifolds in R-HGCNs have been set up with different origin points in different components to facilitate the extraction of feature information from a wider range of perspectives, which enhances the representing capability of R-HGCNs. (3) HyperDrop adds multiplicative Gaussian noise into hyperbolic representations, such that perturbations can be added to alleviate the over-fitting problem without deconstructing the hyperbolic geometry. Experiment results demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds.



Paperid:1801
Authors:Vijaya Krishna Yalavarthi, Kiran Madhusudhanan, Randolf Scholz, Nourhan Ahmed, Johannes Burchert, Shayan Jawed, Stefan Born, Lars Schmidt-Thieme
ISMLL, University of Hildesheim, ISMLL, University of Hildesheim, ISMLL, University of Hildesheim, ISMLL, University of Hildesheim, ISMLL, University of Hildesheim, ISMLL, University of Hildesheim, Technische Universität Berlin, ISMLL, University of Hildesheim
Abstract:
Forecasting irregularly sampled time series with missing values is a crucial task for numerous realworld applications such as healthcare, astronomy, and climate sciences. State-of-the-art approaches to this problem rely on Ordinary Differential Equations (ODEs) which are known to be slow and often require additional features to handle missing values. To address this issue, we propose a novel model using Graphs for Forecasting Irregularly Sampled Time Series with missing values which we call GraFITi. GraFITi first converts the time series to a Sparsity Structure Graph which is a sparse bipartite graph, and then reformulates the forecasting problem as the edge weight prediction task in the graph. It uses the power of Graph Neural Networks to learn the graph and predict the target edge weights. GraFITi has been tested on 3 real-world and 1 synthetic irregularly sampled time series dataset with missing values and compared with various state-of-the-art models. The experimental results demonstrate that GraFITi improves the forecasting accuracy by up to 17% and reduces the run time up to 5 times compared to the state-of-the-art forecasting models.



Paperid:1802
Authors:Xiaoqiang Yan, Yingtao Gan, Yiqiao Mao, Yangdong Ye, Hui Yu
Zhengzhou University, Zhengzhou University, Zhengzhou University, Zhengzhou University, University of Portsmouth
Abstract:
Multiview action clustering leverages the complementary information from different camera views to enhance the clustering performance. Although existing approaches have achieved significant progress, they assume all camera views are available in advance, which is impractical when the camera view is incremental over time. Besides, learning the invariant information among multiple camera views is still a challenging issue, especially in continual learning scenario. Aiming at these problems, we propose a novel continual action clustering (CAC) method, which is capable of learning action categories in a continual learning manner. To be specific, we first devise a category memory library, which captures and stores the learned categories from historical views. Then, as a new camera view arrives, we only need to maintain a consensus partition matrix, which can be updated by leveraging the incoming new camera view rather than keeping all of them. Finally, a three-step alternate optimization is proposed, in which the category memory library and consensus partition matrix are optimized. The empirical experimental results on 6 realistic multi-view action collections demonstrate the excellent clustering performance and time/space efficiency of the CAC compared with 15 state-of-the-art baselines.



Paperid:1803
Authors:Yan Yan, Yuhong Guo
Carleton University, Carleton University Canada CIFAR AI Chair, Amii
Abstract:
Partial label learning (PLL) expands the applicability of supervised machine learning models by enabling effective learning from weakly annotated overcomplete labels. Existing PLL methods however focus on the standard centralized learning scenarios. In this paper, we expand PLL into the distributed computation setting by formalizing a new learning scenario named as federated partial label learning (FedPLL), where the training data with partial labels are distributed across multiple local clients with privacy constraints. To address this challenging problem, we propose a novel Federated PLL method with LocalAdaptive Augmentation and Regularization (FedPLL-LAAR). In addition to alleviating the partial label noise with moving-average label disambiguation, the proposed method performs MixUp-based local-adaptive data augmentation to mitigate the challenge posed by insufficient and imprecisely annotated local data, and dynamically incorporates the guidance of global model to minimize client drift through adaptive gradient alignment regularization between the global and local models. Extensive experiments conducted on multiple datasets under the FedPLL setting demonstrate the effectiveness of the proposed FedPLL-LAAR method for federated partial label learning.



Paperid:1804
Authors:Yuguang Yan, Zhihao Xu, Canlin Yang, Jie Zhang, Ruichu Cai, Michael Kwok-Po Ng
School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, Department of Mathematics, The University of Hong Kong, Hong Kong, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China, Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
Abstract:
Clustering is one of the most fundamental problems in machine learning and data mining, and many algorithms have been proposed in the past decades. Among them, subspace clustering and spectral clustering are the most famous approaches. In this paper, we provide an explanation for subspace clustering and spectral clustering from the perspective of optimal transport. Optimal transport studies how to move samples from one distribution to another distribution with minimal transport cost, and has shown a powerful ability to extract geometric information. By considering a self optimal transport model with only one group of samples, we observe that both subspace clustering and spectral clustering can be explained in the framework of optimal transport, and the optimal transport matrix bridges the spaces of features and spectral embeddings. Inspired by this connection, we propose a spectral optimal transport barycenter model, which learns spectral embeddings by solving a barycenter problem equipped with an optimal transport discrepancy and guidance of data. Based on our proposed model, we take advantage of optimal transport to exploit both feature and metric information involved in data for learning coupled spectral embeddings and affinity matrix in a unified model. We develop an alternating optimization algorithm to solve the resultant problems, and conduct experiments in different settings to evaluate the performance of our proposed methods.



Paperid:1805
Authors:Yuguang Yan, Zeqin Yang, Weilin Chen, Ruichu Cai, Zhifeng Hao, Michael KwokPo Ng
School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China, College of Science, Shantou University, Shantou, China, Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
Abstract:
Estimating treatment effects from observational data suffers from the issue of confounding bias, which is induced by the imbalanced confounder distributions between the treated and control groups. As an effective approach, reweighting learns a group of sample weights to balance the confounder distributions. Existing methods of re-weighting highly rely on a propensity score model or moment alignment. However, for complex real-world data, it is difficult to obtain an accurate propensity score prediction. Although moment alignment is free of learning a propensity score model, accurate estimation for high-order moments is computationally difficult and still remains an open challenge, and first and second-order moments are insufficient to align the distributions and easy to be misled by outliers. In this paper, we exploit geometry to capture the intrinsic structure involved in data for balancing the confounder distributions, so that confounding bias can be reduced even with outliers. To achieve this, we construct a connection between treatment effect estimation and optimal transport, a powerful tool to capture geometric information. After that, we propose an optimal transport model to learn sample weights by extracting geometry from confounders, in which geometric information between groups and within groups is leveraged for better confounder balancing. A projected mirror descent algorithm is employed to solve the derived optimization problem. Experimental studies on both synthetic and real-world datasets demonstrate the effectiveness of our proposed method.



Paperid:1806
Authors:Chengyi Yang, Jiayin Qi, Aimin Zhou
East China Normal University, Guangzhou University, East China Normal University
Abstract:
Differential privacy (DP) has achieved remarkable results in the field of privacypreserving machine learning. However, existing DP frameworks do not satisfy all the conditions for becoming metrics, which prevents them from deriving better basic private properties and leads to exaggerated values on privacy budgets. We propose Wasserstein differential privacy (WDP), an alternative DP framework to measure the risk of privacy leakage, which satisfies the properties of symmetry and triangle inequality. We show and prove that WDP has 13 excellent properties, which can be theoretical supports for the better performance of WDP than other DP frameworks. In addition, we derive a general privacy accounting method called Wasserstein accountant, which enables WDP to be applied in stochastic gradient descent (SGD) scenarios containing subsampling. Experiments on basic mechanisms, compositions and deep learning show that the privacy budgets obtained by Wasserstein accountant are relatively stable and less influenced by order. Moreover, the overestimation on privacy budgets can be effectively alleviated. The code is available at https://github.com/Hifipsysta/WDP.



Paperid:1807
Authors:Dezhi Yang, Xintong He, Jun Wang, Guoxian Yu, Carlotta Domeniconi, Jinglin Zhang
Shandong University, National University of Singapore, Shandong University, Shandong University, George Mason University, Shandong University
Abstract:
Discovering the causality from observational data is a crucial task in various scientific domains. With increasing awareness of privacy, data are not allowed to be exposed, and it is very hard to learn causal graphs from dispersed data, since these data may have different distributions. In this paper, we propose a federated causal discovery strategy (FedCausal) to learn the unified global causal graph from decentralized heterogeneous data. We design a global optimization formula to naturally aggregate the causal graphs from client data and constrain the acyclicity of the global graph without exposing local data. Unlike other federated causal learning algorithms, FedCausal unifies the local and global optimizations into a complete directed acyclic graph (DAG) learning process with a flexible optimization objective. We prove that this optimization objective has a high interpretability and can adaptively handle homogeneous and heterogeneous data. Experimental results on synthetic and real datasets show that FedCausal can effectively deal with nonindependently and identically distributed (non-iid) data and has a superior performance.



Paperid:1808
Authors:Fan Yang, Wei Li, Menglong Yang, Binbin Liang, Jianwei Zhang
Sichuan University, Sichuan University, Sichuan University, Sichuan University, Sichuan University
Abstract:
Descriptionbased person search aims to retrieve images of the target identity via textual descriptions. One of the challenges for this task is to extract discriminative representation from images and descriptions. Most existing methods apply the part-based split method or external models to explore the fine-grained details of local features, which ignore the global relationship between partial information and cause network instability. To overcome these issues, we propose a Multi-modal Disordered Representation Learning Network (MDRL) for description-based person search to fully extract the visual and textual representations. Specifically, we design a Cross-modality Global Feature Learning Architecture to learn the global features from the two modalities and meet the demand of the task. Based on our global network, we introduce a Disorder Local Learning Module to explore local features by a disordered reorganization strategy from both visual and textual aspects and enhance the robustness of the whole network. Besides, we introduce a Cross-modality Interaction Module to guide the two streams to extract visual or textual representations considering the correlation between modalities. Extensive experiments are conducted on two public datasets, and the results show that our method outperforms the state-of-the-art methods on CUHK-PEDES and ICFG-PEDES datasets and achieves superior performance.



Paperid:1809
Authors:Mingzhao Yang, Shangchao Su, Bin Li, Xiangyang Xue
Fudan University, Fudan University, Fudan University, Fudan University
Abstract:
Recently, semisupervised federated learning (semi-FL) has been proposed to handle the commonly seen real-world scenarios with labeled data on the server and unlabeled data on the clients. However, existing methods face several challenges such as communication costs, data heterogeneity, and training pressure on client devices. To address these challenges, we introduce the powerful diffusion models (DM) into semi-FL and propose FedDISC, a Federated Diffusion-Inspired Semi-supervised Co-training method. Specifically, we first extract prototypes of the labeled server data and use these prototypes to predict pseudo-labels of the client data. For each category, we compute the cluster centroids and domain-specific representations to signify the semantic and stylistic information of their distributions. After adding noise, these representations are sent back to the server, which uses the pre-trained DM to generate synthetic datasets complying with the client distributions and train a global model on it. With the assistance of vast knowledge within DM, the synthetic datasets have comparable quality and diversity to the client images, subsequently enabling the training of global models that achieve performance equivalent to or even surpassing the ceiling of supervised centralized training. FedDISC works within one communication round, does not require any local training, and involves very minimal information uploading, greatly enhancing its practicality. Extensive experiments on three large-scale datasets demonstrate that FedDISC effectively addresses the semi-FL problem on non-IID clients and outperforms the compared SOTA methods. Sufficient visualization experiments also illustrate that the synthetic dataset generated by FedDISC exhibits comparable diversity and quality to the original client dataset, with a neglectable possibility of leaking privacy-sensitive information of the clients.



Paperid:1810
Authors:Senqiao Yang, Jiarui Wu, Jiaming Liu, Xiaoqi Li, Qizhe Zhang, Mingjie Pan, Yulu Gan, Zehui Chen, Shanghang Zhang
Peking University, Peking University, Peking University, Peking University, Peking University, Peking University, Peking University, University of Science and Technology of China, Peking University
Abstract:
The visual prompts have provided an efficient manner in addressing visual crossdomain problems. Previous works introduce domain prompts to tackle the classification Test-Time Adaptation (TTA) problem by placing image-level prompts on the input and fine-tuning prompts for each target domain. However, since the image-level prompts mask out continuous spatial details in the prompt-allocated region, it will suffer from inaccurate contextual information and limited domain knowledge extraction, particularly when dealing with dense prediction TTA problems. To overcome these challenges, we propose a novel Sparse Visual Domain Prompts (SVDP) approach, which applies minimal trainable parameters (e.g., 0.1%) to pixels across the entire image and reserves more spatial information of the input. To better apply SVDP in extracting domain-specific knowledge, we introduce the Domain Prompt Placement (DPP) method to adaptively allocates trainable parameters of SVDP on the pixels with large distribution shifts. Furthermore, recognizing that each target domain sample exhibits a unique domain shift, we design Domain Prompt Updating (DPU) strategy to optimize prompt parameters differently for each sample, facilitating efficient adaptation to the target domain. Extensive experiments were conducted on widely-used TTA and continual TTA benchmarks, and our proposed method achieves state-of-the-art performance in both semantic segmentation and depth estimation tasks.



Paperid:1811
Authors:Sikun Yang, Hongyuan Zha
School of Computing and Information Technology, Great Bay University, 523000 Dongguan, China Great Bay Institute for Advanced Study, Great Bay University Dongguan Key Laboratory for Data Science and Intelligent Medicine, Great Bay University, Shenzhen Institute of Artificial Intelligence and Robotics for Society The Chinese University of Hong Kong, Shenzhen
Abstract:
Continuously observed event occurrences, often exhibit self and mutually exciting effects, which can be well modeled using temporal point processes. Beyond that, these event dynamics may also change over time, with certain periodic trends. We propose a novel variational autoencoder to capture such a mixture of temporal dynamics. More specifically, the whole time interval of the input sequence is partitioned into a set of sub intervals. The event dynamics are assumed to be stationary within each subinterval, but could be changing across those subintervals. In particular, we use a sequential latent variable model to learn a dependency graph between the observed dimensions, for each subinterval. The model predicts the future event times, by using the learned dependency graph to remove the non contributing influences of past events. By doing so, the proposed model demonstrates its higher accuracy in predicting inter event times and event types for several real world event sequences, compared with existing state of the art neural point processes.



Paperid:1812
Authors:Tianpei Yang, Heng You, Jianye Hao, Yan Zheng, Matthew E. Taylor
College of Intelligence and Computing, Tianjin University University of Alberta, College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University, College of Intelligence and Computing, Tianjin University, University of Alberta Alberta Machine Intelligence Institute (Amii)
Abstract:
Transfer learning (TL) has shown great potential to improve Reinforcement Learning (RL) efficiency by leveraging prior knowledge in new tasks. However, much of the existing TL research focuses on transferring knowledge between tasks that share the same stateaction spaces. Further, transfer from multiple source tasks that have different state-action spaces is more challenging and needs to be solved urgently to improve the generalization and practicality of the method in real-world scenarios. This paper proposes TURRET (Transfer Using gRaph neuRal nETworks), to utilize the generalization capabilities of Graph Neural Networks (GNNs) to facilitate efficient and effective multi-source policy transfer learning in the state-action mismatch setting. TURRET learns a semantic representation by accounting for the intrinsic property of the agent through GNNs, which leads to a unified state embedding space for all tasks. As a result, TURRET achieves more efficient transfer with strong generalization ability between different tasks and can be easily combined with existing Deep RL algorithms. Experimental results show that TURRET significantly outperforms other TL methods on multiple continuous action control tasks, successfully transferring across robots with different state-action spaces.



Paperid:1813
Authors:Xiao-Wen Yang, Jie-Jing Shao, Wei-Wei Tu, Yu-Feng Li, Wang-Zhou Dai, Zhi-Hua Zhou
Nanjing University, Nanjing University, 4Paradigm Inc., Nanjing University, Nanjing University, Nanjing University
Abstract:
Integrating complementary strengths of raw data and logical rules to improve the learning generalization has been recently shown promising and effective, e.g., abductive learning is one generic framework that can learn the perception model from data and reason between rules simultaneously. However, the performance would be seriously decreased when inaccurate logical rules appear, which may be even worse than baselines using only raw data. Efforts on this issue are highly desired while remain to be limited. This paper proposes a simple and effective safe abductive learning method to alleviate the harm caused by inaccurate rules. Unlike the existing methods which directly use all rules without correctness checks, it utilizes them selectively by constructing a graphical model with an adaptive reasoning process to prevent performance hazards. Theoretically, we show that induction and abduction are mutually beneficial, and can be rigorously justified from a classical maximum likelihood estimation perspective. Experiments on diverse tasks show that our method can tolerate at least twice as many inaccurate rules as accurate ones and achieve highly competitive performance while other methods can't. Moreover, the proposal can refine inaccurate rules and works well in extended weakly supervised scenarios.



Paperid:1814
Authors:YongJin Yang, Taehyeon Kim, Se-Young Yun
KAIST AI, KAIST AI, KAIST AI
Abstract:
Crossdomain few-shot learning presents a formidable challenge, as models must be trained on base classes and then tested on novel classes from various domains with only a few samples at hand. While prior approaches have primarily focused on parameter-efficient methods of using adapters, they often overlook two critical issues: shifts in batch statistics and noisy sample statistics arising from domain discrepancy variations. In this paper, we introduce Leveraging Normalization Layer in Adapters with Progressive Learning and Adaptive Distillation (ProLAD), marking two principal contributions. First, our methodology utilizes two separate adapters: one devoid of a normalization layer, which is more effective for similar domains, and another embedded with a normalization layer, designed to leverage the batch statistics of the target domain, thus proving effective for dissimilar domains. Second, to address the pitfalls of noisy statistics, we deploy two strategies: a progressive training of the two adapters and an adaptive distillation technique derived from features determined by the model solely with the adapter devoid of a normalization layer. Through this adaptive distillation, our approach functions as a modulator, controlling the primary adapter for adaptation, based on each domain. Evaluations on standard cross-domain few-shot learning benchmarks confirm that our technique outperforms existing state-of-the-art methodologies.



Paperid:1815
Authors:Zhaoyuan Yang, Zhiwei Xu, Jing Zhang, Richard Hartley, Peter Tu
GE Research, Australian National University, Australian National University, Australian National University, GE Research
Abstract:
In this work, we formulate a novel framework for adversarial robustness using the manifold hypothesis. This framework provides sufficient conditions for defending against adversarial examples. We develop an adversarial purification method with this framework. Our method combines manifold learning with variational inference to provide adversarial robustness without the need for expensive adversarial training. Experimentally, our approach can provide adversarial robustness even if attackers are aware of the existence of the defense. In addition, our method can also serve as a testtime defense mechanism for variational autoencoders.



Paperid:1816
Authors:Samuel Yang-Zhao, Kee Siong Ng, Marcus Hutter
Australian National University, Australian National University, Google DeepMind Australian National University
Abstract:
Prior approximations of AIXI, a Bayesian optimality notion for general reinforcement learning, can only approximate AIXI's Bayesian environment model using an apriori defined set of models. This is a fundamental source of epistemic uncertainty for the agent in settings where the existence of systematic bias in the predefined model class cannot be resolved by simply collecting more data from the environment. We address this issue in the context of Human-AI teaming by considering a setup where additional knowledge for the agent in the form of new candidate models arrives from a human operator in an online fashion. We introduce a new agent called DynamicHedgeAIXI that maintains an exact Bayesian mixture over dynamically changing sets of models via a time-adaptive prior constructed from a variant of the Hedge algorithm. The DynamicHedgeAIXI agent is the richest direct approximation of AIXI known to date and comes with good performance guarantees. Experimental results on epidemic control on contact networks validates the agent's practical utility.



Paperid:1817
Authors:Dixi Yao, Baochun Li
University of Toronto, University of Toronto
Abstract:
Personalized federated learning is a new paradigm to address heterogeneous problems (e.g. issues with noni.i.d. data) in federated learning. However, existing personalized federated learning methods lack standards for how personalized and shared parts of the models are designed. Sometimes, manual design can even lead to worse performance than non-personalization. As a result, we propose a new algorithm for personalized federated neural architecture search, called PerFedRLNAS, to automatically personalize the architectures and weights of models on each client. With such an algorithm, we can solve the issues of low efficiency as well as failure to adapt to new search spaces in previous federated neural architecture search work. We further show that with automatically assigning different client architectures can solve heterogeneity of data distribution, efficiency and memory in federated learning. In our experiments, we empirically show that our framework shows much better performance with respect to personalized accuracy and overall time compared to state-of-the-art methods. Furthermore, PerFedRLNAS has a good generalization ability to new clients, and is easy to be deployed in practice.



Paperid:1818
Authors:Mingshuai Yao, Yabo Zhang, Xianhui Lin, Xiaoming Li, Wangmeng Zuo
Harbin Institute of Technology, Harbin Institute of Technology, Institute for Intelligent Computing, Harbin Institute of Technology, Harbin Institute of Technology Peng Cheng Laboratory
Abstract:
Fewshot font generation is challenging, as it needs to capture the fine-grained stroke styles from a limited set of reference glyphs, and then transfer to other characters, which are expected to have similar styles. However, due to the diversity and complexity of Chinese font styles, the synthesized glyphs of existing methods usually exhibit visible artifacts, such as missing details and distorted strokes. In this paper, we propose a VQGAN-based framework (i.e., VQ-Font) to enhance glyph fidelity through token prior refinement and structure-aware enhancement. Specifically, we pre-train a VQGAN to encapsulate font token prior within a code-book. Subsequently, VQ-Font refines the synthesized glyphs with the codebook to eliminate the domain gap between synthesized and real-world strokes. Furthermore, our VQ-Font leverages the inherent design of Chinese characters, where structure components such as radicals and character components are combined in specific arrangements, to recalibrate fine-grained styles based on references. This process improves the matching and fusion of styles at the structure level. Both modules collaborate to enhance the fidelity of the generated fonts. Experiments on a collected font dataset show that our VQ-Font outperforms the competing methods both quantitatively and qualitatively, especially in generating challenging styles. Our code is available at https://github.com/Yaomingshuai/VQ-Font.



Paperid:1819
Authors:Wenfang Yao, Kejing Yin, William K. Cheung, Jia Liu, Jing Qin
The Hong Kong Polytechnic University, Hong Kong Baptist University, Hong Kong Baptist University, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, The Hong Kong Polytechnic University
Abstract:
The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognoses. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multimodal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models.



Paperid:1820
Authors:Xufeng Yao, Fanbin Lu, Yuechen Zhang, Xinyun Zhang, Wenqian Zhao, Bei Yu
The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong, The Chinese University of Hong Kong
Abstract:
Knowledge distillation aims at transferring knowledge from the teacher model to the student one by aligning their distributions. Featurelevel distillation often uses L2 distance or its variants as the loss function, based on the assumption that outputs follow normal distributions. This poses a significant challenge when distribution gaps are substantial since this loss function ignores the variance term. To address the problem, we propose to decompose the transfer objective into small parts and optimize it progressively. This process is inspired by diffusion models from which the noise distribution is mapped to the target distribution step by step. However, directly employing diffusion models is impractical in the distillation scenario due to its heavy reverse process. To overcome this challenge, we adopt the structural re-parameterization technique to generate multiple student features to approximate the teacher features sequentially. The multiple student features are combined linearly in inference time without extra cost. We present extensive experiments performed on various transfer scenarios, such as CNN-to-CNN and Transformer-to-CNN, that validate the effectiveness of our approach.



Paperid:1821
Authors:Yang Yao, Xin Wang, Yijian Qin, Ziwei Zhang, Wenwu Zhu, Hong Mei
Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University, Department of Computer Science and Technology, Tsinghua University Beijing National Research Center for Information Science and Technology, Tsinghua University, MoE Key Lab of High Confidence Software Technologies, Peking University
Abstract:
Graph neural architecture search (NAS) has achieved great success in designing architectures for graph data processing.However, distribution shifts pose great challenges for graph NAS, since the optimal searched architectures for the training graph data may fail to generalize to the unseen test graph data. The sole prior work tackles this problem by customizing architectures for each graph instance through learning graph structural information, but failed to consider data augmentation during training, which has been proven by existing works to be able to improve generalization.In this paper, we propose Dataaugmented Curriculum Graph Neural Architecture Search (DCGAS), which learns an architecture customizer with good generalizability to data under distribution shifts. Specifically, we design an embedding-guided data generator, which can generate sufficient graphs for training to help the model better capture graph structural information. In addition, we design a two-factor uncertainty-based curriculum weighting strategy, which can evaluate the importance of data in enabling the model to learn key information in real-world distribution and reweight them during training. Experimental results on synthetic datasets and real datasets with distribution shifts demonstrate that our proposed method learns generalizable mappings and outperforms existing methods.



Paperid:1822
Authors:Fei Ye, Adrian G. Bors
University of York Mohamed bin Zayed University of Artificial Intelligence, University of York Mohamed bin Zayed University of Artificial Intelligence
Abstract:
Vision Transformers (ViTs) represent selfattention-based network backbones shown to be efficient in many individual tasks, but which have not been explored in Task-Free Continual Learning (TFCL) so far. Most existing ViT-based approaches for Continual Learning (CL) are relying on task information. In this study, we explore the advantages of the ViT in a more challenging CL scenario where the task boundaries are unavailable during training. To address this learning paradigm, we propose the Task-Free Dynamic Sparse Vision Transformer (TFDSViT), which can dynamically build new sparse experts, where each expert leverages sparsity to allocate the model's capacity for capturing different information categories over time. To avoid forgetting and ensure efficiency in reusing the previously learned knowledge in subsequent learning, we propose a new dynamic dual attention mechanism consisting of the Sparse Attention (SA') and Knowledge Transfer Attention (KTA) modules. The SA' refrains from updating some previously learned attention blocks for preserving prior knowledge. The KTA uses and regulates the information flow of all previously learned experts for learning new patterns. The proposed dual attention mechanism can simultaneously relieve forgetting and promote knowledge transfer for a dynamic expansion model in a task-free manner. We also propose an energy-based dynamic expansion mechanism using the energy as a measure of novelty for the incoming samples which provides appropriate expansion signals leading to a compact network architecture for TFDSViT. Extensive empirical studies demonstrate the effectiveness of TFDSViT. The code and supplementary material (SM) are available at https://github.com/dtuzi123/TFDSViT.



Paperid:1823
Authors:Fei Ye, Adrian G. Bors
Department of Computer Science, University of York, York YO10 5GH, UK Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, Department of Computer Science, University of York, York YO10 5GH, UK Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Abstract:
Human brains can continually acquire and learn new skills and knowledge over time from a dynamically changing environment without forgetting previously learnt information. Such a capacity can selectively transfer some important and recently seen information to the persistent knowledge regions of the brain. Inspired by this intuition, we propose a new memorybased approach for image reconstruction and generation in continual learning, consisting of a temporary and evolving memory, with two different storage strategies, corresponding to the temporary and permanent memorisation. The temporary memory aims to preserve up-to-date information while the evolving memory can dynamically increase its capacity in order to preserve permanent knowledge information. This is achieved by the proposed memory expansion mechanism that selectively transfers those data samples deemed as important from the temporary memory to new clusters defined within the evolved memory according to an information novelty criterion. Such a mechanism promotes the knowledge diversity among clusters in the evolved memory, resulting in capturing more diverse information by using a compact memory capacity. Furthermore, we propose a two-step optimization strategy for training a Variational Autoencoder (VAE) to implement generation and representation learning tasks, which updates the generator and inference models separately using two optimisation paths. This approach leads to a better trade-off between generation and reconstruction performance. We show empirically and theoretically that the proposed approach can learn meaningful latent representations while generating diverse images from different domains. The source code and supplementary material (SM) are available at https://github.com/dtuzi123/DEMC.



Paperid:1824
Authors:Kai Ye, Tiejin Chen, Hua Wei, Liang Zhan
University of Pittsburgh, Arizona State University, Arizona State University, University of Pittsburgh
Abstract:
The Evidential Regression Network (ERN) represents a novel approach that integrates deep learning with DempsterShafer's theory to predict a target and quantify the associated uncertainty. Guided by the underlying theory, specific activation functions must be employed to enforce non-negative values, which is a constraint that compromises model performance by limiting its ability to learn from all samples. This paper provides a theoretical analysis of this limitation and introduces an improvement to overcome it. Initially, we define the region where the models can't effectively learn from the samples. Following this, we thoroughly analyze the ERN and investigate this constraint. Leveraging the insights from our analysis, we address the limitation by introducing a novel regularization term that empowers the ERN to learn from the whole training set. Our extensive experiments substantiate our theoretical findings and demonstrate the effectiveness of the proposed solution.



Paperid:1825
Authors:Yuhao Yi, Ronghui You, Hong Liu, Changxin Liu, Yuan Wang, Jiancheng Lv
Sichuan University, Nankai University, Sichuan University, KTH Royal Institute of Technology, Hunan University, Sichuan University
Abstract:
Byzantine machine learning has garnered considerable attention in light of the unpredictable faults that can occur in largescale distributed learning systems. The key to secure resilience against Byzantine machines in distributed learning is resilient aggregation mechanisms. Although abundant resilient aggregation rules have been proposed, they are designed in ad-hoc manners, imposing extra barriers on comparing, analyzing, and improving the rules across performance criteria. This paper studies near-optimal aggregation rules using clustering in the presence of outliers. Our outlier-robust clustering approach utilizes geometric properties of the update vectors provided by workers. Our analysis show that constant approximations to the 1-center and 1-mean clustering problems with outliers provide near-optimal resilient aggregators for metric-based criteria, which have been proven to be crucial in the homogeneous and heterogeneous cases respectively. In addition, we discuss two contradicting types of attacks under which no single aggregation rule is guaranteed to improve upon the naive average. Based on the discussion, we propose a two-phase resilient aggregation framework. We run experiments for image classification using a non-convex loss function. The proposed algorithms outperform previously known aggregation rules by a large margin with both homogeneous and heterogeneous data distributions among non-faulty workers. Code and appendix are available at https://github.com/jerry907/AAAI24-RASHB.



Paperid:1826
Authors:Jun Yin, Shiliang Sun, Lai Wei, Pei Wang
Shanghai Maritime University, Shanghai Jiao Tong University, Shanghai Maritime University, Shanghai Maritime University
Abstract:
Multiview K-means clustering successfully generalizes K-means from single-view to multi-view, and obtains excellent clustering performance. In every view, it makes each data point close to the center of the corresponding cluster. However, multi-view K-means only considers the compactness of each cluster, but ignores the separability of different clusters, which is of great importance to producing a good clustering result. In this paper, we propose Discriminatively Fuzzy Multi-view K-means clustering with Local Structure Preserving (DFMKLS). On the basis of minimizing the distance between each data point and the center of the corresponding cluster, DFMKLS separates clusters by maximizing the distance between the centers of pairwise clusters. DFMKLS also relaxes its objective by introducing the idea of fuzzy clustering, which calculates the probability that a data point belongs to each cluster. Considering multi-view K-means mainly focuses on the global information of the data, to efficiently use the local information, we integrate the local structure preserving into the framework of DFMKLS. The effectiveness of DFMKLS is evaluated on benchmark multi-view datasets. It obtains superior performances than state-of-the-art multi-view clustering methods, including multi-view K-means.



Paperid:1827
Authors:Naiyu Yin, Tian Gao, Yue Yu, Qiang Ji
Rensselaer Polytechnic Institute, IBM Research, Lehigh University, Renselaer Polytechnic Institute
Abstract:
Capturing the underlying structural causal relations represented by Directed Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines. Causal DAG learning via the continuous optimization framework has recently achieved promising performance in terms of accuracy and efficiency. However, most methods make strong assumptions of homoscedastic noise, i.e., exogenous noises have equal variances across variables, observations, or even both. The noises in real data usually violate both assumptions due to the biases introduced by different data collection processes. To address the heteroscedastic noise issue, we introduce relaxed implementable sufficient conditions and prove the identifiability of a general class of SEM subject to those conditions. Based on the identifiable general SEM, we propose a novel formulation for DAG learning which accounts for the noise variance variation across variables and observations. We then propose an effective twophase iterative DAG learning algorithm to address the increasing optimization difficulties and learn a causal DAG from data with heteroscedastic variables noise under varying variance. We show significant empirical gains of the proposed approaches over state-of-the-art methods on both synthetic data and real data.



Paperid:1828
Authors:Nan Yin, Mengzhu Wang, Zhenghan Chen, Giulia De Masi, Huan Xiong, Bin Gu
Mohamed bin Zayed University of Artificial Intelligence, School of Artificial Intelligence, Hebei University of Technology, Microsoft Corporation, Technology Innovation Institute, Jilin University Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Harbin Institute of Technology Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Abstract:
The integration of Spiking Neural Networks (SNNs) and Graph Neural Networks (GNNs) is gradually attracting attention due to the low power consumption and high efficiency in processing the nonEuclidean data represented by graphs. However, as a common problem, dynamic graph representation learning faces challenges such as high complexity and large memory overheads. Current work often uses SNNs instead of Recurrent Neural Networks (RNNs) by using binary features instead of continuous ones for efficient training, which overlooks graph structure information and leads to the loss of details during propagation. Additionally, optimizing dynamic spiking models typically requires the propagation of information across time steps, which increases memory requirements. To address these challenges, we present a framework named Dynamic Spiking Graph Neural Networks (Dy-SIGN). To mitigate the information loss problem, Dy-SIGN propagates early-layer information directly to the last layer for information compensation. To accommodate the memory requirements, we apply the implicit differentiation on the equilibrium state, which does not rely on the exact reverse of the forward computation. While traditional implicit differentiation methods are usually used for static situations, Dy-SIGN extends it to the dynamic graph setting. Extensive experiments on three large-scale real-world dynamic graph datasets validate the effectiveness of Dy-SIGN on dynamic node classification tasks with lower computational costs.



Paperid:1829
Authors:Zhihui Yin, Jiexi Yan, Chenghao Xu, Cheng Deng
Xidian University, Xidian University, Xidian University, Xidian University
Abstract:
In recent years, many methods have been proposed to address the zeroshot sketch-based image retrieval (ZS-SBIR) task, which is a practical problem in many applications. However, in real-world scenarios, on the one hand, we can not obtain training data with the same distribution as the test data, and on the other hand, the labels of training data are not available as usual. To tackle this issue, we focus on a new problem, namely unsupervised zero-shot sketch-based image retrieval (UZS-SBIR), where the available training data does not have labels while the training and testing categories are not overlapping. In this paper, we introduce a new asymmetric mutual alignment method (AMA) including a self-distillation module and a cross-modality mutual alignment module. First, we conduct self-distillation to extract the feature embeddings from unlabeled data. Due to the lack of available information in an unsupervised manner, we employ the cross-modality mutual alignment module to further excavate underlying intra-modality and inter-modality relationships from unlabeled data, and take full advantage of these correlations to align the feature embeddings in image and sketch domains. Meanwhile, the feature representations are enhanced by the intra-modality clustering relations, leading to better generalization ability to unseen classes. Moreover, we conduct an asymmetric strategy to update the teacher and student networks, respectively. Extensive experimental results on several benchmark datasets demonstrate the superiority of our method.



Paperid:1830
Authors:Gwangpyo Yoo, Jinwoo Park, Honguk Woo
Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University
Abstract:
In application domains requiring missioncritical decision making, such as finance and robotics, the optimal policy derived by reinforcement learning (RL) often hinges on a preference for risk management. Yet, the dynamic nature of risk measures poses considerable challenges to achieving generalization and adaptation of risk-sensitive policies in the context of RL. In this paper, we propose a risk-conditioned RL model that enables rapid policy adaptation to varying risk measures via a unified risk representation, the Weighted Value-at-Risk (WV@R). To sample risk measures that avoid undue optimism, we construct a risk proposal network employing a conditional adversarial auto-encoder and a normalizing flow. This network establishes coherent representations for risk measures, preserving the continuity in terms of the Wasserstein distance on the risk measures. The normalizing flow is used to support non-crossing quantile regression that obtains valid samples for risk measures, and it is also applied to the agent’s critic to ascertain the preservation of monotonicity in quantile estimations. Through experiments with locomotion, finance, and self-driving scenarios, we show that our model is capable of adapting to a range of risk measures, achieving comparable performance to the baseline models individually trained for each measure. Our model often outperforms the baselines, especially in the cases when exploration is required during training but risk-aversion is favored during evaluation.



Paperid:1831
Authors:En Yu, Jie Lu, Bin Zhang, Guangquan Zhang
University of Technology Sydney, University of Technology Sydney, University of Technology Sydney, University of Technology Sydney
Abstract:
Multistream classification poses significant challenges due to the necessity for rapid adaptation in dynamic streaming processes with concept drift. Despite the growing research outcomes in this area, there has been a notable oversight regarding the temporal dynamic relationships between these streams, leading to the issue of negative transfer arising from irrelevant data. In this paper, we propose a novel Online Boosting Adaptive Learning (OBAL) method that effectively addresses this limitation by adaptively learning the dynamic correlation among different streams. Specifically, OBAL operates in a dualphase mechanism, in the first of which we design an Adaptive COvariate Shift Adaptation (AdaCOSA) algorithm to construct an initialized ensemble model using archived data from various source streams, thus mitigating the covariate shift while learning the dynamic correlations via an adaptive re-weighting strategy. During the online process, we employ a Gaussian Mixture Model-based weighting mechanism, which is seamlessly integrated with the acquired correlations via AdaCOSA to effectively handle asynchronous drift. This approach significantly improves the predictive performance and stability of the target stream. We conduct comprehensive experiments on several synthetic and real-world data streams, encompassing various drifting scenarios and types. The results clearly demonstrate that OBAL achieves remarkable advancements in addressing multistream classification problems by effectively leveraging positive knowledge derived from multiple sources.



Paperid:1832
Authors:Fangchao Yu, Bo Zeng, Kai Zhao, Zhi Pang, Lina Wang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University
Abstract:
Split learning is a computing resourcefriendly distributed learning framework that protects client training data by splitting the model between the client and server. Previous work has proved that split learning faces a severe risk of privacy leakage, as a malicious server can recover the client's private data by hijacking the training process. In this paper, we first explore the vulnerability of split learning to server-side backdoor attacks, where our goal is to compromise the model's integrity. Since the server-side attacker cannot access the training data and client model in split learning, the traditional poisoning-based backdoor attack methods are no longer applicable. Therefore, constructing backdoor attacks in split learning poses significant challenges. Our strategy involves the attacker establishing a shadow model on the server side that can encode backdoor samples and guiding the client model to learn from this model during the training process, thereby enabling the client to acquire the same capability. Based on these insights, we propose a three-stage backdoor attack framework named SFI. Our attack framework minimizes assumptions about the attacker's background knowledge and ensures that the attack process remains imperceptible to the client. We implement SFI on various benchmark datasets, and extensive experimental results demonstrate its effectiveness and generality. For example, success rates of our attack on MNIST, Fashion, and CIFAR10 datasets all exceed 90%, with limited impact on the main task.



Paperid:1833
Authors:Hanfei Yu, Jian Li, Yang Hua, Xu Yuan, Hao Wang
Louisiana State University, Stony Brook University, Queen's University Belfast, University of Delaware, Louisiana State University
Abstract:
Deep reinforcement learning (DRL) has gained immense success in many applications, including gaming AI, robotics, and system scheduling. Distributed algorithms and architectures have been vastly proposed (e.g., actorlearner architecture) to accelerate DRL training with large-scale server-based clusters. However, training on-policy algorithms with the actor-learner architecture unavoidably induces resource wasting due to synchronization between learners and actors, thus resulting in significantly extra billing. As a promising alternative, serverless computing naturally fits on-policy synchronization and alleviates resource wasting in distributed DRL training with pay-as-you-go pricing. Yet, none has leveraged serverless computing to facilitate DRL training. This paper proposes MinionsRL, the first serverless distributed DRL training framework that aims to accelerate DRL training- and cost-efficiency with dynamic actor scaling. We prototype MinionsRL on top of Microsoft Azure Container Instances and evaluate it with popular DRL tasks from OpenAI Gym. Extensive experiments show that MinionsRL reduces total training time by up to 52% and training cost by 86% compared to latest solutions.



Paperid:1834
Authors:Hang Yu, Zhengyang Liu, Xiangfeng Luo
Shanghai University, Shanghai University, Shanghai University
Abstract:
In recent years, graphbased fraud detection methods have garnered increasing attention for their superior ability to tackle the issue of camouflage in fraudulent scenarios. However, these methods often rely on a substantial proportion of samples as the training set, disregarding the reality of scarce annotated samples in real-life scenarios. As a theoretical framework within semi-supervised learning, the principle of consistency regularization posits that unlabeled samples should be classified into the same category as their own perturbations. Inspired by this principle, this study incorporates unlabeled samples as an auxiliary during model training, designing a novel barely supervised learning method to address the challenge of limited annotated samples in fraud detection. Specifically, to tackle the issue of camouflage in fraudulent scenarios, we employ disentangled representation learning based on edge information for a small subset of annotated nodes. This approach partitions node features into three distinct components representing different connected edges, providing a foundation for the subsequent augmentation of unlabeled samples. For the unlabeled nodes used in auxiliary training, we apply both strong and weak augmentation and design regularization losses to enhance the detection performance of the model in the context of extremely limited labeled samples. Across five publicly available datasets, the proposed model showcases its superior detection capability over baseline models.



Paperid:1835
Authors:Shengju Yu, Siwei Wang, Zhibin Dong, Wenxuan Tu, Suyuan Liu, Zhao Lv, Pan Li, Miao Wang, En Zhu
School of Computer, National University of Defense Technology, Intelligent Game and Decision Lab, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, School of Computer, National University of Defense Technology, Intelligent Game and Decision Lab, Intelligent Game and Decision Lab, Intelligent Game and Decision Lab, School of Computer, National University of Defense Technology
Abstract:
Multiview graph clustering (MVGC) derives encouraging grouping results by seamlessly integrating abundant information inside heterogeneous data, and has captured surging focus recently. Nevertheless, the majority of current MVGC works involve at least one hyper-parameter, which not only requires additional efforts for tuning, but also leads to a complicated solving procedure, largely harming the flexibility and scalability of corresponding algorithms. To this end, in the article we are devoted to getting rid of hyper-parameters, and devise a non-parametric graph clustering (NpGC) framework to more practically partition multi-view data. To be specific, we hold that hyper-parameters play a role in balancing error item and regularization item so as to form high-quality clustering representations. Therefore, under without the assistance of hyper-parameters, how to acquire high-quality representations becomes the key. Inspired by this, we adopt two types of anchors, view-related and view-unrelated, to concurrently mine exclusive characteristics and common characteristics among views. Then, all anchors' information is gathered together via a consensus bipartite graph. By such ways, NpGC extracts both complementary and consistent multi-view features, thereby obtaining superior clustering results. Also, linear complexities enable it to handle datasets with over 120000 samples. Numerous experiments reveal NpGC's strong points compared to lots of classical approaches.



Paperid:1836
Authors:Shengju Yu, Siwei Wang, Pei Zhang, Miao Wang, Ziming Wang, Zhe Liu, Liming Fang, En Zhu, Xinwang Liu
School of Computer,National University of Defense Technology, Intelligent Game and Decision Lab, School of Computer,National University of Defense Technology, Intelligent Game and Decision Lab, China Academy of Aerospace Science and Innovation, Zhejiang Lab, Nanjing University of Aeronautics and Astronautics, School of Computer,National University of Defense Technology, School of Computer,National University of Defense Technology
Abstract:
In numerous realworld applications, it is quite common that sample information is partially available for some views due to machine breakdown or sensor failure, causing the problem of incomplete multi-view clustering (IMVC). While several IMVC approaches using view-shared anchors have successfully achieved pleasing performance improvement, (1) they generally construct anchors with only one dimension, which could deteriorate the multi-view diversity, bringing about serious information loss; (2) the constructed anchors are typically with a single size, which could not sufficiently characterize the distribution of the whole samples, leading to limited clustering performance. For generating view-shared anchors with multi-dimension and multi-size for IMVC, we design a novel framework called Diverse View-Shared Anchors based Incomplete multi-view clustering (DVSAI). Concretely, we associate each partial view with several potential spaces. In each space, we enable anchors to communicate among views and generate the view-shared anchors with space-specific dimension and size. Consequently, spaces with various scales make the generated view-shared anchors enjoy diverse dimensions and sizes. Subsequently, we devise an integration scheme with linear computational and memory expenditures to integrate the outputted multi-scale unified anchor graphs such that running spectral algorithm generates the spectral embedding. Afterwards, we theoretically demonstrate that DVSAI owns linear time and space costs, thus well-suited for tackling large-size datasets. Finally, comprehensive experiments confirm the effectiveness and advantages of DVSAI.



Paperid:1837
Authors:Xingtong Yu, Yuan Fang, Zemin Liu, Xinming Zhang
University of Science and Technology of China, China, Singapore Management University, Singapore, National University of Singapore, Singapore, University of Science and Technology of China, China
Abstract:
Graph neural networks (GNNs) and heterogeneous graph neural networks (HGNNs) are prominent techniques for homogeneous and heterogeneous graph representation learning, yet their performance in an endto-end supervised framework greatly depends on the availability of task-specific supervision. To reduce the labeling cost, pre-training on self-supervised pretext tasks has become a popular paradigm, but there is often a gap between the pre-trained model and downstream tasks, stemming from the divergence in their objectives. To bridge the gap, prompt learning has risen as a promising direction especially in few-shot settings, without the need to fully fine-tune the pre-trained model. While there has been some early exploration of prompt-based learning on graphs, they primarily deal with homogeneous graphs, ignoring the heterogeneous graphs that are prevalent in downstream applications. In this paper, we propose HGPROMPT, a novel pre-training and prompting framework to unify not only pre-training and downstream tasks but also homogeneous and heterogeneous graphs via a dual-template design. Moreover, we propose dual-prompt in HGPROMPT to assist a downstream task in locating the most relevant prior to bridge the gaps caused by not only feature variations but also heterogeneity differences across tasks. Finally, we thoroughly evaluate and analyze HGPROMPT through extensive experiments on three public datasets.



Paperid:1838
Authors:Yang Yu, Danruo Deng, Furui Liu, Qi Dou, Yueming Jin, Guangyong Chen, Pheng Ann Heng
The Chinese University of Hong Kong, The Chinese University of Hong Kong, Zhejiang Lab, The Chinese University of Hong Kong, National University of Singapore, Zhejiang Lab, The Chinese University of Hong Kong
Abstract:
Semisupervised learning (SSL) methods assume that labeled data, unlabeled data and test data are from the same distribution. Open-set semi-supervised learning (Open-set SSL) con- siders a more practical scenario, where unlabeled data and test data contain new categories (outliers) not observed in labeled data (inliers). Most previous works focused on out- lier detection via binary classifiers, which suffer from insufficient scalability and inability to distinguish different types of uncertainty. In this paper, we propose a novel framework, Adaptive Negative Evidential Deep Learning (ANEDL) to tackle these limitations. Concretely, we first introduce evidential deep learning (EDL) as an outlier detector to quantify different types of uncertainty, and design different uncertainty metrics for self-training and inference. Furthermore, we propose a novel adaptive negative optimization strategy, making EDL more tailored to the unlabeled dataset containing both inliers and outliers. As demonstrated empirically, our proposed method outperforms existing state-of-the-art methods across four datasets.



Paperid:1839
Authors:Zhidong Yu, Wei Yang, Xike Xie, Zhenbo Shi
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China Hefei National Laboratory, Hefei 230088, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China, School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
Abstract:
Continual Semantic Segmentation (CSS) is an emerging trend, where catastrophic forgetting has been a perplexing problem. In this paper, we propose a Textto-Image Knowledge Preservation (TIKP) framework to address this issue. TIKP applies Text-to-Image techniques to CSS by automatically generating prompts and content adaptation. It extracts associations between the labels of seen data and constructs text-level prompts based on these associations, which are preserved and maintained at each incremental step. During training, these prompts generate correlated images to mitigate the catastrophic forgetting. Particularly, as the generated images may have different distributions from the original data, TIKP transfers the knowledge by a content adaption loss, which determines the role played by the generated images in incremental training based on the similarity. In addition, for the classifier, we use the previous model from a different perspective: misclassifying new classes into old objects instead of the background. We propose a knowledge distillation loss based on wrong labels, enabling us to attribute varying weights to individual objects during the distillation process. Extensive experiments conducted in the same setting show that TIKP outperforms state-of-the-art methods by a large margin on benchmark datasets.



Paperid:1840
Authors:Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin Cui
School of Computer Science & Key Lab of High Confidence Software Technologies (MOE), Peking University, School of Computer Science & Key Lab of High Confidence Software Technologies (MOE), Peking University, School of Computer Science & Key Lab of High Confidence Software Technologies (MOE), Peking University, Carnegie Mellon University, School of Computer Science & Key Lab of High Confidence Software Technologies (MOE), Peking University Institute of Computational Social Science, Peking University (Qingdao), China
Abstract:
Due to the recent success of diffusion models, textto-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from the low inference efficiency of many existing diffusion models even using GPU accelerators. To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing. The key intuition behind our approach is to utilize the semantic mapping between the minor modifications on the input text and the affected regions on the output image. For each text editing step, FISEdit can 1) automatically identify the affected image regions and 2) utilize the cached unchanged regions' feature map to accelerate the inference process. For the former, we measure the differences between cached and ad hoc feature maps given the modified textual description, extract the region with significant differences, and capture the affected region by masks. For the latter, we develop an efficient sparse diffusion inference engine that only computes the feature maps for the affected region while reusing the cached statistics for the rest of the image. Finally, extensive empirical results show that FISEdit can be 3.4 times and 4.4 times faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images.



Paperid:1841
Authors:Yige Yuan, Bingbing Xu, Bo Lin, Liang Hou, Fei Sun, Huawei Shen, Xueqi Cheng
CAS Key Laboratory of AI Safety & Security, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, CAS Key Laboratory of AI Safety & Security, Institute of Computing Technology, Chinese Academy of Sciences, Department of Mathematics, National University of Singapore, CAS Key Laboratory of AI Safety & Security, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, CAS Key Laboratory of AI Safety & Security, Institute of Computing Technology, Chinese Academy of Sciences, CAS Key Laboratory of AI Safety & Security, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, CAS Key Laboratory of AI Safety & Security, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
The generalization of neural networks is a central challenge in machine learning, especially concerning the performance under distributions that differ from training ones. Current methods, mainly based on the datadriven paradigm such as data augmentation, adversarial training, and noise injection, may encounter limited generalization due to model non-smoothness. In this paper, we propose to investigate generalization from a Partial Differential Equation (PDE) perspective, aiming to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data. Specifically, we first establish the connection between neural network generalization and the smoothness of the solution to a specific PDE, namely transport equation. Building upon this, we propose a general framework that introduces adaptive distributional diffusion into transport equation to enhance the smoothness of its solution, thereby improving generalization. In the context of neural networks, we put this theoretical framework into practice as PDE+ (PDE with Adaptive Distributional Diffusion) which diffuses each sample into a distribution covering semantically similar inputs. This enables better coverage of potentially unobserved distributions in training, thus improving generalization beyond merely data-driven methods. The effectiveness of PDE+ is validated through extensive experimental settings, demonstrating its superior performance compared to state-of-the-art methods. Our code is available at https://github.com/yuanyige/pde-add.



Paperid:1842
Authors:Zixuan Yuan, Hao Liu, Haoyi Zhou, Denghui Zhang, Xiao Zhang, Hao Wang, Hui Xiong
Hong Kong University of Science and Technology (Guangzhou), Hong Kong University of Science and Technology (Guangzhou), Beihang University, Stevens Institute of Technology, Shandong University, Alibaba Group, Hong Kong University of Science and Technology (Guangzhou)
Abstract:
Hierarchical MultiLabel Classification (HMLC) is a well-established problem that aims at assigning data instances to multiple classes stored in a hierarchical structure. Despite its importance, existing approaches often face two key limitations: (i) They employ dense networks to solely explore the class hierarchy as hard criterion for maintaining taxonomic consistency among predicted classes, yet without leveraging rich semantic relationships between instances and classes; (ii) They struggle to generalize in settings with deep class levels, since the mini-batches uniformly sampled from different levels ignore the varying complexities of data and result in a non-smooth model adaptation to sparse data. To mitigate these issues, we present a Self-Paced Unified Representation (SPUR) learning framework, which focuses on the interplay between instance and classes to flexibly organize the training process of HMLC algorithms. Our framework consists of two lightweight encoders designed to capture the semantics of input features and the topological information of the class hierarchy. These encoders generate unified embeddings of instances and class hierarchy, which enable SPUR to exploit semantic dependencies between them and produce predictions in line with taxonomic constraints. Furthermore, we introduce a dynamic hardness measurement strategy that considers both class hierarchy and instance features to estimate the learning difficulty of each instance. This strategy is achieved by incorporating the propagation loss obtained at each hierarchical level, allowing for a more comprehensive assessment of learning complexity. Extensive experiments on several empirical benchmarks demonstrate the effectiveness and efficiency of SPUR compared to state-of-the-art methods, especially in scenarios with missing features.



Paperid:1843
Authors:Angxiao Yue, Dixin Luo, Hongteng Xu
Renmin University of China, Beijing, China, Beijing Institute of Technology, Beijing, China, Renmin University of China, Beijing, China Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
Abstract:
Graph neural networks have been widely used to represent 3D molecules, which capture molecular attributes and geometric information through various messagepassing mechanisms. This study proposes a novel quaternion message-passing (QMP) module that can be plugged into many existing 3D molecular representation models and enhance their power for distinguishing molecular conformations. In particular, our QMP module represents the 3D rotations between one chemical bond and its neighbor bonds as a quaternion sequence. Then, it aggregates the rotations by the chained Hamilton product of the quaternions. The real part of the output quaternion is invariant to the global 3D rotations of molecules but sensitive to the local torsions caused by twisting bonds, providing discriminative information for training molecular conformation representation models. In theory, we prove that considering these features enables invariant GNNs to distinguish the conformations caused by bond torsions. We encapsulate the QMP module with acceleration, so combining existing models with the QMP requires merely one-line code and little computational cost. Experiments on various molecular datasets show that plugging our QMP module into existing invariant GNNs leads to consistent and significant improvements in molecular conformation representation and downstream tasks.



Paperid:1844
Authors:Yu Zang, Zhe Xue, Shilong Ou, Lingyang Chu, Junping Du, Yunfei Long
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, McMaster University, Beijing University Of Posts And Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Asynchronous federated learning (AFL) is a distributed machine learning technique that allows multiple devices to collaboratively train deep learning models without sharing local data. However, AFL suffers from low efficiency due to poor client model training quality and slow server model convergence speed, which are a result of the heterogeneous nature of both data and devices. To address these issues, we propose Efficient Asynchronous Federated Learning with Prospective Momentum Aggregation and FineGrained Correction (FedAC). Our framework consists of three key components. The first component is client weight evaluation based on temporal gradient, which evaluates the client weight based on the similarity between the client and server update directions. The second component is adaptive server update with prospective weighted momentum, which uses an asynchronous buffered update strategy and a prospective weighted momentum with adaptive learning rate to update the global model in server. The last component is client update with fine-grained gradient correction, which introduces a fine-grained gradient correction term to mitigate the client drift and correct the client stochastic gradient. We conduct experiments on real and synthetic datasets, and compare with existing federated learning methods. Experimental results demonstrate effective improvements in model training efficiency and AFL performance by our framework.



Paperid:1845
Authors:Qiuhao Zeng, Wei Wang, Fan Zhou, Gezheng Xu, Ruizhi Pu, Changjian Shui, Christian Gagné, Shichun Yang, Charles X. Ling, Boyu Wang
University of Western Ontario, University of Western Ontario, Beihang University, University of Western Ontario, University of Western Ontario, Vector Institute, Université Laval, Beihang University, University of Western Ontario, University of Western Ontario
Abstract:
In the field of domain generalization, the task of constructing a predictive model capable of generalizing to a target domain without access to target data remains challenging. This problem becomes further complicated when considering evolving dynamics between domains. While various approaches have been proposed to address this issue, a comprehensive understanding of the underlying generalization theory is still lacking. In this study, we contribute novel theoretic results that aligning conditional distribution leads to the reduction of generalization bounds. Our analysis serves as a key motivation for solving the Temporal Domain Generalization (TDG) problem through the application of Koopman Neural Operators, resulting in Temporal Koopman Networks (TKNets). By employing Koopman Neural Operators, we effectively address the timeevolving distributions encountered in TDG using the principles of Koopman theory, where measurement functions are sought to establish linear transition relations between evolving domains. Through empirical evaluations conducted on synthetic and real-world datasets, we validate the effectiveness of our proposed approach.



Paperid:1846
Authors:Zhichen Zeng, Boxin Du, Si Zhang, Yinglong Xia, Zhining Liu, Hanghang Tong
University of Illinois at Urbana-Champaign, Amazon, Meta, Meta, University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign
Abstract:
Finding node correspondence across networks, namely multinetwork alignment, is an essential prerequisite for joint learning on multiple networks. Despite great success in aligning networks in pairs, the literature on multi-network alignment is sparse due to the exponentially growing solution space and lack of high-order discrepancy measures. To fill this gap, we propose a hierarchical multi-marginal optimal transport framework named HOT for multi-network alignment. To handle the large solution space, multiple networks are decomposed into smaller aligned clusters via the fused Gromov-Wasserstein (FGW) barycenter. To depict high-order relationships across multiple networks, the FGW distance is generalized to the multi-marginal setting, based on which networks can be aligned jointly. A fast proximal point method is further developed with guaranteed convergence to a local optimum. Extensive experiments and analysis show that our proposed HOT achieves significant improvements over the state-of-the-art in both effectiveness and scalability.



Paperid:1847
Authors:Lei Zhai, Shuyuan Yang, Yitong Li, Zhixi Feng, Zhihao Chang, Quanwei Gao
Xidian University, Xidian University, Xi’an Jiaotong University, Xidian university, Xidian University, Xidian University
Abstract:
Deep learning methods have achieved outstanding performance in various signal tasks. However, due to degraded signals in real electromagnetic environment, it is crucial to seek methods that can improve the representation of signal features. In this paper, a Singular Value decompositionbased Attention, SVA is proposed to explore structure of signal data for adaptively enhancing intrinsic feature. Using a deep neural network as a base model, SVA performs feature semantic subspace learning through a decomposition layer and combines it with an attention layer to achieve adaptive enhancement of signal features. Moreover, we consider the gradient explosion problem brought by SVA and optimize SVA to improve the stability of training. Extensive experimental results demon-strate that applying SVA to a generalized classification model can significantly improve its ability in representations, making its recognition performance competitive with, or even better than, the state-of-the-art task-specific models.



Paperid:1848
Authors:Yuanzhao Zhai, Yiying Li, Zijian Gao, Xudong Gong, Kele Xu, Dawei Feng, Ding Bo, Huaimin Wang
National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China, Artificial Intelligence Research Center, DII, Beijing, China, National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China, National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China, National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China, National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China, National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China, National University of Defense Technology, Changsha, China State Key Laboratory of Complex & Critical Software Environment, Changsha, China
Abstract:
Modelbased offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards, and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.



Paperid:1849
Authors:Baoquan Zhang, Chuyao Luo, Demin Yu, Xutao Li, Huiwei Lin, Yunming Ye, Bowen Zhang
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology Shenzhen Graduate School, Harbin Institute of Technology, Harbin Institute of Technology Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen Technology University
Abstract:
Equipping a deep model the ability of fewshot learning (FSL) is a core challenge for artificial intelligence. Gradient-based meta-learning effectively addresses the challenge by learning how to learn novel tasks. Its key idea is learning a deep model in a bi-level optimization manner, where the outer-loop process learns a shared gradient descent algorithm (called meta-optimizer), while the inner-loop process leverages it to optimize a task-specific base learner with few examples. Although these methods have shown superior performance on FSL, the outer-loop process requires calculating second-order derivatives along the inner-loop path, which imposes considerable memory burdens and the risk of vanishing gradients. This degrades meta-learning performance. Inspired by recent diffusion models, we find that the inner-loop gradient descent process can be viewed as a reverse process (i.e., denoising) of diffusion where the target of denoising is the weight of base learner but origin data. Based on this fact, we propose to model the gradient descent algorithm as a diffusion model and then present a novel conditional diffusion-based meta-learning, called MetaDiff, that effectively models the optimization process of base learner weights from Gaussian initialization to target weights in a denoising manner. Thanks to the training efficiency of diffusion models, our MetaDiff does not need to differentiate through the inner-loop path such that the memory burdens and the risk of vanishing gradients can be effectively alleviated for improving FSL. Experimental results show that our MetaDiff outperforms state-of-the-art gradient-based meta-learning family on FSL tasks.



Paperid:1850
Authors:Chao Zhang, Xiuyi Jia, Zechao Li, Chunlin Chen, Huaxiong Li
Nanjing University, Nanjing University of Science and Technology, Nanjing University of Science and Technology, Nanjing University, Nanjing University
Abstract:
Due to its effectiveness and efficiency, anchor based multiview clustering (MVC) has recently attracted much attention. Most existing approaches try to adaptively learn anchors to construct an anchor graph for clustering. However, they generally focus on improving the diversity among anchors by using orthogonal constraint and ignore the underlying semantic relations, which may make the anchors not representative and discriminative enough. To address this problem, we propose an adaptive Cluster-wise Anchor learning based MVC method, CAMVC for short. We first make an anchor cluster assumption that supposes the prior cluster structure of target anchors by pre-defining a consensus cluster indicator matrix. Based on the prior knowledge, an explicit cluster structure of latent anchors is enforced by learning diverse cluster centroids, which can explore both inter-cluster diversity and intra-cluster consistency of anchors, and improve the subspace representation discrimination. Extensive results demonstrate the effectiveness and superiority of our proposed method compared with some state-of-the-art MVC approaches.



Paperid:1851
Authors:Dekai Zhang, Matt Williams, Francesca Toni
Department of Computing, Imperial College London, Department of Radiotherapy, Charing Cross Hospital Institute of Global Health Innovation, Imperial College London, Department of Computing, Imperial College London
Abstract:
Neural networks (NNs) can learn to rely on spurious signals in the training data, leading to poor generalisation. Recent methods tackle this problem by training NNs with additional groundtruth annotations of such signals. These methods may, however, let spurious signals re-emerge in deep convolutional NNs (CNNs). We propose Targeted Activation Penalty (TAP), a new method tackling the same problem by penalising activations to control the re-emergence of spurious signals in deep CNNs, while also lowering training times and memory usage. In addition, ground-truth annotations can be expensive to obtain. We show that TAP still works well with annotations generated by pre-trained models as effective substitutes of ground-truth annotations. We demonstrate the power of TAP against two state-of-the-art baselines on the MNIST benchmark and on two clinical image datasets, using four different CNN architectures.



Paperid:1852
Authors:Ding-Chu Zhang, Zhi Zhou, Yu-Feng Li
Nanjing University, Nanjing University, Nanjing University
Abstract:
CLIP has demonstrated remarkable generalization across diverse downstream tasks. By aligning images and texts in a shared feature space, they enable zeroshot classification via hand-crafted prompts. However, recent studies have shown that hand-crafted prompts may be unsuitable in practical applications. Specifically, choosing an appropriate prompt for a given task requires accurate data and knowledge, which may not be obtainable in practical situations. An inappropriate prompt can result in poor performance. Moreover, if there is no training data, tuning prompts arbitrarily through unlabeled test data may lead to serious performance degradation when giving hand-crafted prompts. Our study reveals that the aforementioned problems are mainly due to the biases in testing data (Data Bias) and pre-trained CLIP model (Model Bias). The Data Bias makes it challenging to choose an appropriate prompt, while Model Bias renders some predictions inaccurate and biased, which leads to error accumulation. To address these biases, we propose robust test-time Adaptation for zeroshot Prompt tuning (ADAPROMPT). Specifically, we ensemble multiple prompts to avoid the worst-case results and dynamically tune prompts to adapt to Data Bias during testing. Furthermore, we adopt a confidence-aware buffer to store balanced and confident unlabeled test data to tune prompts in order to overcome Model Bias. Our extensive experiments on several benchmarks demonstrate that ADAPROMPT alleviates model bias, adapts to data bias and mostly outperforms the state-of-the-art methods at a small time cost. Moreover, our experimental results reveal that ADAPROMPT hardly encounters any performance degradation on these datasets.



Paperid:1853
Authors:Dongmei Zhang, Chang Li, Renrui Zhang, Shenghao Xie, Wei Xue, Xiaodong Xie, Shanghang Zhang
Peking University, Peking University, The Chinese University of Hong Kong, Wuhan University, Hong Kong Univerisity of Science and Technology, Peking University, Peking University
Abstract:
The superior performances of pretrained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.



Paperid:1854
Authors:Hansong Zhang, Shikun Li, Dan Zeng, Chenggang Yan, Shiming Ge
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100092, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100092, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China, Department of Communication Engineering, Shanghai University, Shanghai 200040, China, Hangzhou Dianzi University, Hangzhou 310018, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100092, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:
As the size of the datasets getting larger, accurately annotating such datasets is becoming more impractical due to the expensiveness on both time and economy. Therefore, crowdsourcing has been widely adopted to alleviate the cost of collecting labels, which also inevitably introduces label noise and eventually degrades the performance of the model. To learn from crowd-sourcing annotations, modeling the expertise of each annotator is a common but challenging paradigm, because the annotations collected by crowd-sourcing are usually highly-sparse. To alleviate this problem, we propose Coupled Confusion Correction (CCC), where two models are simultaneously trained to correct the confusion matrices learned by each other. Via bi-level optimization, the confusion matrices learned by one model can be corrected by the distilled data from the other. Moreover, we cluster the ``annotator groups'' who share similar expertise so that their confusion matrices could be corrected together. In this way, the expertise of the annotators, especially of those who provide seldom labels, could be better captured. Remarkably, we point out that the annotation sparsity not only means the average number of labels is low, but also there are always some annotators who provide very few labels, which is neglected by previous works when constructing synthetic crowd-sourcing annotations. Based on that, we propose to use Beta distribution to control the generation of the crowd-sourcing labels so that the synthetic annotations could be more consistent with the real-world ones. Extensive experiments are conducted on two types of synthetic datasets and three real-world datasets, the results of which demonstrate that CCC significantly outperforms state-of-the-art approaches. Source codes are available at: https://github.com/Hansong-Zhang/CCC.



Paperid:1855
Authors:Hao-Kai Zhang, Chengkai Zhu, Geng Liu, Xin Wang
Institute for Advanced Study, Tsinghua University, Beijing 100084, China Institute for Quantum Computing, Baidu Research, Beijing 100193, China, Thrust of Artificial Intelligence, Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China Institute for Quantum Computing, Baidu Research, Beijing 100193, China, Institute for Quantum Computing, Baidu Research, Beijing 100193, China, Thrust of Artificial Intelligence, Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China Institute for Institute for Quantum Computing, Baidu Research, Beijing 100193, China
Abstract:
Quantum neural networks (QNNs) have become a leading paradigm for establishing nearterm quantum applications in recent years. The trainability issue of QNNs has garnered extensive attention, spurring demand for a comprehensive analysis of QNNs in order to identify viable solutions. In this work, we propose a perspective that characterizes the trainability of QNNs based on their locality. We prove that the entire variation range of the loss function via adjusting any local quantum gate vanishes exponentially in the number of qubits with a high probability for a broad class of QNNs. This result reveals extra harsh constraints independent of gradients and unifies the restrictions on gradient-based and gradient-free optimizations naturally. We showcase the validity of our results with numerical simulations of representative models and examples. Our findings, as a fundamental property of random quantum circuits, deepen the understanding of the role of locality in QNNs and serve as a guideline for assessing the effectiveness of diverse training strategies for quantum neural networks.



Paperid:1856
Authors:Heng-Kai Zhang, Yi-Ge Zhang, Zhi Zhou, Yu-Feng Li
Nanjing University, Nanjing University, Nanjing University, Nanjing University
Abstract:
Graph Attention Networks (GATs) that compute node representation by its lowerorder neighbors, are state-of-the-art architecture for representation learning with graphs. In practice, however, the high-order neighbors that turn out to be useful, remain largely unemployed in GATs. Efforts on this issue remain to be limited. This paper proposes a simple and effective high-order neighbor GAT (HONGAT) model to both effectively exploit informative high-order neighbors and address over-smoothing at the decision boundary of nodes. Two tightly coupled novel technologies, namely common neighbor similarity and new masking matrix, are introduced. Specifically, high-order neighbors are fully explored by generic high-order common-neighbor-based similarity; in order to prevent severe over-smoothing, typical averaging range no longer works well and a new masking mechanism is employed without any extra hyperparameter. Extensive empirical evaluation on real-world datasets clearly shows the necessity of the new algorithm in the ability of exploring high-order neighbors, which promisingly achieves significant gains over previous state-of-the-art graph attention methods.



Paperid:1857
Authors:Hong Zhang, Yu Zhang
State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University, Hangzhou, China, State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University, Hangzhou, China Key Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province, Hangzhou, China
Abstract:
Spiking neural networks (SNNs) are potential competitors to artificial neural networks (ANNs) due to their high energyefficiency on neuromorphic hardware. However, SNNs are unfolded over simulation time steps during the training process. Thus, SNNs require much more memory than ANNs, which impedes the training of deeper SNN models. In this paper, we propose the reversible spiking neural network to reduce the memory cost of intermediate activations and membrane potentials during training. Firstly, we extend the reversible architecture along temporal dimension and propose the reversible spiking block, which can reconstruct the computational graph and recompute all intermediate variables in forward pass with a reverse process. On this basis, we adopt the state-of-the-art SNN models to the reversible variants, namely reversible spiking ResNet (RevSResNet) and reversible spiking transformer (RevSFormer). Through experiments on static and neuromorphic datasets, we demonstrate that the memory cost per image of our reversible SNNs does not increase with the network depth. On CIFAR10 and CIFAR100 datasets, our RevSResNet37 and RevSFormer-4-384 achieve comparable accuracies and consume 3.79x and 3.00x lower GPU memory per image than their counterparts with roughly identical model complexity and parameters. We believe that this work can unleash the memory constraints in SNN training and pave the way for training extremely large and deep SNNs.



Paperid:1858
Authors:Jianqing Zhang, Yang Liu, Yang Hua, Jian Cao
Shanghai Jiao Tong University, Institute for AI Industry Research, Tsinghua University, Queen's University Belfast, Shanghai Jiao Tong University
Abstract:
Recently, Heterogeneous Federated Learning (HtFL) has attracted attention due to its ability to support heterogeneous models and data. To reduce the high communication cost of transmitting model parameters, a major challenge in HtFL, prototypebased HtFL methods are proposed to solely share class representatives, a.k.a, prototypes, among heterogeneous clients while maintaining the privacy of clients’ models. However, these prototypes are naively aggregated into global prototypes on the server using weighted averaging, resulting in suboptimal global knowledge which negatively impacts the performance of clients. To overcome this challenge, we introduce a novel HtFL approach called FedTGP, which leverages our Adaptive-margin-enhanced Contrastive Learning (ACL) to learn Trainable Global Prototypes (TGP) on the server. By incorporating ACL, our approach enhances prototype separability while preserving semantic meaning. Extensive experiments with twelve heterogeneous models demonstrate that our FedTGP surpasses state-of-the-art methods by up to 9.08% in accuracy while maintaining the communication and privacy advantages of prototype-based HtFL. Our code is available at https://github.com/TsingZ0/FedTGP.



Paperid:1859
Authors:Litian Zhang, Xiaoming Zhang, Ziyi Zhou, Feiran Huang, Chaozhuo Li
Beihang University, Beihang University, Beihang University, Jinan University, Beijing University of Posts and Telecommunications
Abstract:
Nowadays, detecting multimodal fake news has emerged as a foremost concern since the widespread dissemination of fake news may incur adverse societal impact. Conventional methods generally focus on capturing the linguistic and visual semantics within the multimodal content, which fall short in effectively distinguishing the heightened level of meticulous fabrications. Recently, external knowledge is introduced to provide valuable background facts as complementary to facilitate news detection. Nevertheless, existing knowledgeenhanced endeavors directly incorporate all knowledge contexts through static entity embeddings, resulting in the potential noisy and content-irrelevant knowledge. Moreover, the integration of knowledge entities makes it intractable to model the sophisticated correlations between multimodal semantics and knowledge entities. In light of these limitations, we propose a novel Adaptive Knowledge-Aware Fake News Detection model, dubbed AKA-Fake. For each news, AKA-Fake learns a compact knowledge subgraph under a reinforcement learning paradigm, which consists of a subset of entities and contextual neighbors in the knowledge graph, restoring the most informative knowledge facts. A novel heterogeneous graph learning module is further proposed to capture the reliable cross-modality correlations via topology refinement and modality-attentive pooling. Our proposal is extensively evaluated over three popular datasets, and experimental results demonstrate the superiority of AKA-Fake.



Paperid:1860
Authors:Pingyue Zhang, Mengyue Wu
MoE Key Lab of Artificial Intelligence, AI Institute X-LANCE Lab, Department of Computer Science and Engineering Shanghai Jiao Tong University, Shanghai, China, MoE Key Lab of Artificial Intelligence, AI Institute X-LANCE Lab, Department of Computer Science and Engineering Shanghai Jiao Tong University, Shanghai, China
Abstract:
Multilabel classification is an arduous problem given the complication in label correlation. Whilst sharing a common goal with contrastive learning in utilizing correlations for representation learning, how to better leverage label information remains challenging. Previous endeavors include extracting label-level presentations or mapping labels to an embedding space, overlooking the correlation between multiple labels. It exhibits a great ambiguity in determining positive samples with different extent of label overlap between samples and integrating such relations in loss functions. In our work, we propose Multi-Label Supervised Contrastive learning (MulSupCon) with a novel contrastive loss function to adjust weights based on how much overlap one sample shares with the anchor. By analyzing gradients, we explain why our method performs better under multi-label circumstances. To evaluate, we conduct direct classification and transfer learning on several multi-label datasets, including widely-used image datasets such as MS-COCO and NUS-WIDE. Validation indicates that our method outperforms the traditional multi-label classification method and shows a competitive performance when comparing to other existing approaches.



Paperid:1861
Authors:Qiao Zhang, Tao Xiang, Chunsheng Xin, Hongyi Wu
Chongqing University, Chongqing University, Old Dominion University, The University of Arizona
Abstract:
Privacypreserving Machine Learning as a Service (MLaaS) enables the powerful cloud server to run its well-trained neural model upon the input from resource-limited client, with both of server's model parameters and client's input data protected. While computation efficiency is critical for the practical implementation of privacy-preserving MLaaS and it is inspiring to witness recent advances towards efficiency improvement, there still exists a significant performance gap to real-world applications. In general, state-of-the-art frameworks perform function-wise efficiency optimization based on specific cryptographic primitives. Although it is logical, such independent optimization for each function makes noticeable amount of expensive operations unremovable and misses the opportunity to further accelerate the performance by jointly considering privacy-preserving computation among adjacent functions. As such, we propose COIN: Conjunctive Optimization with Interleaved Nexus, which remodels mainstream computation for each function to conjunctive counterpart for composite function, with a series of united optimization strategies. Specifically, COIN jointly computes a pair of consecutive nonlinear-linear functions in the neural model by reconstructing the intermediates throughout the whole procedure, which not only eliminates the most expensive crypto operations without invoking extra encryption enabler, but also makes the online crypto complexity independent of filter size. Experimentally, COIN demonstrates 11.2x to 29.6x speedup over various function dimensions from modern networks, and 6.4x to 12x speedup on the total computation time when applied in networks with model input from small-scale CIFAR10 to large-scale ImageNet.



Paperid:1862
Authors:Rongchao Zhang, Yiwei Lou, Dexuan Xu, Yongzhi Cao, Hanpin Wang, Yu Huang
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China, School of Software & Microelectronics, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China, Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China National Engineering Research Center for Software Engineering, Peking University, Beijing, China
Abstract:
The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacyprotecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets.



Paperid:1863
Authors:Rongyu Zhang, Yulin Luo, Jiaming Liu, Huanrui Yang, Zhen Dong, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Yuan Du, Shanghang Zhang
Nanjing University National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, University of California, Berkeley, University of California, Berkeley, Panasonic, Panasonic, Panasonic, University of California, Berkeley, Nanjing University, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Abstract:
The Mixtureof-Experts (MoE) approach has demonstrated outstanding scalability in multi-task learning including low-level upstream tasks such as concurrent removal of multiple adverse weather effects. However, the conventional MoE architecture with parallel Feed Forward Network (FFN) experts leads to significant parameter and computational overheads that hinder its efficient deployment. In addition, the naive MoE linear router is suboptimal in assigning task-specific features to multiple experts which limits its further scalability. In this work, we propose an efficient MoE architecture with weight sharing across the experts. Inspired by the idea of linear feature modulation (FM), our architecture implicitly instantiates multiple experts via learnable activation modulations on a single shared expert block. The proposed Feature Modulated Expert (FME) serves as a building block for the novel Mixture-of-Feature-Modulation-Experts (MoFME) architecture, which can scale up the number of experts with low overhead. We further propose an Uncertainty-aware Router (UaR) to assign task-specific features to different FM modules with well-calibrated weights. This enables MoFME to effectively learn diverse expert functions for multiple tasks. The conducted experiments on the multi-deweather task show that our MoFME outperforms the state-of-the-art in the image restoration quality by 0.1-0.2 dB while saving more than 74% of parameters and 20% inference time over the conventional MoE counterpart. Experiments on the downstream segmentation and classification tasks further demonstrate the generalizability of MoFME to real open-world applications.



Paperid:1864
Authors:Ruining Zhang, Haoran Han, Maolong Lv, Qisong Yang, Jian Cheng
School of Information and Communication Engineering, University of Electronic Science and Technology of China Glasgow School, University of Electronic Science and Technology of China, School of Information and Communication Engineering, University of Electronic Science and Technology of China, Air Traffic Control and Navigation College, Air Force Engineering University, Xi'an Institute of High-Tech, School of Information and Communication Engineering, University of Electronic Science and Technology of China
Abstract:
Extensive utilization of deep reinforcement learning (DRL) policy networks in diverse continuous control tasks has raised questions regarding performance degradation in expansive state spaces where the input state norm is larger than that in the training environment. This paper aims to uncover the underlying factors contributing to such performance deterioration when dealing with expanded state spaces, using a novel analysis technique known as state division. In contrast to prior approaches that employ state division merely as a posthoc explanatory tool, our methodology delves into the intrinsic characteristics of DRL policy networks. Specifically, we demonstrate that the expansion of state space induces the activation function $\tanh$ to exhibit saturability, resulting in the transformation of the state division boundary from nonlinear to linear. Our analysis centers on the paradigm of the double-integrator system, revealing that this gradual shift towards linearity imparts a control behavior reminiscent of bang-bang control. However, the inherent linearity of the division boundary prevents the attainment of an ideal bang-bang control, thereby introducing unavoidable overshooting. Our experimental investigations, employing diverse RL algorithms, establish that this performance phenomenon stems from inherent attributes of the DRL policy network, remaining consistent across various optimization algorithms.



Paperid:1865
Authors:Shaojie Zhang, Chun Shen, Shuai Lü, Zeyu Zhang
Jilin University, Jilin University, Jilin University, Jilin University
Abstract:
For addressing the data privacy and portability issues of domain adaptation, Domain Adaptation of Blackbox Predictors (DABP) aims to adapt a black-box source model to an unlabeled target domain without accessing both the source-domain data and details of the source model. Although existing DABP approaches based on knowledge distillation (KD) have achieved promising results, we experimentally find that these methods all have the minority class forgetting issue, which refers that the trained model completely forgets some minority classes. To address this issue, we propose a method called Reviewing the Forgotten Classes (RFC), which including two main modules. Firstly, we propose a simple but effective component called selection training (ST). ST selects classes that the model tends to forget according to the learning status of the model and obtains clean samples of the selected classes with the small-loss criterion for enhanced training. ST is orthogonal to previous methods and can effectively alleviate their minority class forgetting issue. Secondly, we find that neighborhood clustering (NC) can help the model learn more balanced than KD so that further alleviate the minority class forgetting issue. However, NC is based on the fact that target features from the source model already form some semantic structure, while DABP is unable to obtain the source model. Thus, we use KD and ST to warm up the target model to form a certain semantic structure. Overall, our method inherits the merits of both ST and NC, and achieves state of the art on three DABP benchmarks.



Paperid:1866
Authors:Shimin Zhang, Qu Yang, Chenxiang Ma, Jibin Wu, Haizhou Li, Kay Chen Tan
The Hong Kong Polytechnic University, National University of Singapore, The Hong Kong Polytechnic University, The Hong Kong Polytechnic University, The Chinese University of Hong Kong, Shenzhen National University of Singapore, The Hong Kong Polytechnic University
Abstract:
The identification of sensory cues associated with potential opportunities and dangers is frequently complicated by unrelated events that separate useful cues by long delays. As a result, it remains a challenging task for stateof-the-art spiking neural networks (SNNs) to establish long-term temporal dependency between distant cues. To address this challenge, we propose a novel biologically inspired Two-Compartment Leaky Integrate-and-Fire spiking neuron model, dubbed TC-LIF. The proposed model incorporates carefully designed somatic and dendritic compartments that are tailored to facilitate learning long-term temporal dependencies. Furthermore, the theoretical analysis is provided to validate the effectiveness of TC-LIF in propagating error gradients over an extended temporal duration. Our experimental results, on a diverse range of temporal classification tasks, demonstrate superior temporal classification capability, rapid training convergence, and high energy efficiency of the proposed TC-LIF model. Therefore, this work opens up a myriad of opportunities for solving challenging temporal processing tasks on emerging neuromorphic computing systems. Our code is publicly available at https://github.com/ZhangShimin1/TC-LIF.



Paperid:1867
Authors:Shuo Zhang, Yuwen Li, Zhongyu Wang, Jianqing Li, Chengyu Liu
Southeast University, Southeast University, Southeast University, Southeast University, Southeast University
Abstract:
Datasets often include noisy labels, but learning from them is difficult. Since mislabeled examples usually have larger loss values in training, the smallloss trick is regarded as a standard metric to identify the clean example from the training set for better performance. Nonetheless, this proposal ignores that some clean but hard-to-learn examples also generate large losses. They could be misidentified by this criterion. In this paper, we propose a new metric called the Integrated Area Margin (IAM), which is superior to the traditional small-loss trick, particularly in recognizing the clean but hard-to-learn examples. According to the IAM, we further offer the Hyperspherical Margin Weighting (HMW) approach. It is a new sample weighting strategy that restructures the importance of each example. It should be highlighted that our approach is universal and can strengthen various methods in this field. Experiments on both benchmark and real-world datasets indicate that our HMW outperforms many state-of-the-art approaches in learning with noisy label tasks. Codes are available at https://github.com/Zhangshuojackpot/HMW.



Paperid:1868
Authors:Wang Zhang, Ziwen Martin Ma, Subhro Das, Tsui-Wei Lily Weng, Alexandre Megretski, Luca Daniel, Lam M. Nguyen
Massachusetts Institute of Technology, Harvard University, MIT-IBM Watson AI Lab, IBM Research, University of California San Diego, Massachusetts Institute of Technology, Massachusetts Institute of Technology, IBM Research, Thomas J. Watson Research Center
Abstract:
Neural networks are powerful tools in various applications, and quantifying their uncertainty is crucial for reliable decisionmaking. In the deep learning field, the uncertainties are usually categorized into aleatoric (data) and epistemic (model) uncertainty. In this paper, we point out that the existing popular variance attenuation method highly overestimates aleatoric uncertainty. To address this issue, we proposed a new estimation method by actively de-noising the observed data. By conducting a broad range of experiments, we demonstrate that our proposed approach provides a much closer approximation to the actual data uncertainty than the standard method.



Paperid:1869
Authors:Wei Zhang, Brian Barr, John Paisley
Columbia University, Capital One, Columbia University
Abstract:
Deep neural networks have revolutionized many fields, but their blackbox nature also occasionally prevents their wider adoption in fields such as healthcare and finance where interpretable and explainable models are required. The recent development of Neural Additive Models (NAMs) poses a major step in the direction of interpretable deep learning for tabular datasets. In this paper, we propose a new subclass of NAMs that utilize a single-layer neural network construction of the Gaussian process via random Fourier features, which we call Gaussian Process Neural Additive Models (GP-NAM). GP-NAMs have the advantage of a convex objective function and number of trainable parameters that grows linearly with feature dimensions. It suffers no loss in performance compared with deeper NAM approaches because GPs are well-suited to learning complex non-parametric univariate functions. We demonstrate the performance of GP-NAM on several tabular datasets, showing that it achieves comparable performance in both classification and regression tasks with a massive reduction in the number of parameters.



Paperid:1870
Authors:Xianda Zhang, Baolin Zheng, Jianbao Hu, Chengyang Li, Xiaoying Bai
Department of Computer Science and Technology, Peking University Advanced Institute of Big Data, Alibaba Group, University of Glasgow, Department of Computer Science and Technology, Peking University, Advanced Institute of Big Data
Abstract:
Despite the tremendous success of deep neural networks (DNNs) across various fields, their susceptibility to potential backdoor attacks seriously threatens their application security, particularly in safetycritical or security-sensitive ones. Given this growing threat, there is a pressing need for research into purging backdoors from DNNs. However, prior efforts on erasing backdoor triggers not only failed to withstand increasingly powerful attacks but also resulted in reduced model performance. In this paper, we propose From Toxic to Trustworthy (FTT), an innovative approach to eliminate backdoor triggers while simultaneously enhancing model accuracy. Following the stringent and practical assumption of limited availability of clean data, we introduce a self-attention distillation (SAD) method to remove the backdoor by aligning the shallow and deep parts of the network. Furthermore, we first devise a semi-supervised learning (SSL) method that leverages ubiquitous and available poisoned data to further purify backdoors and improve accuracy. Extensive experiments on various attacks and models have shown that our FTT can reduce the attack success rate from 97% to 1% and improve the accuracy of 4% on average, demonstrating its effectiveness in mitigating backdoor attacks and improving model performance. Compared to state-of-the-art (SOTA) methods, our FTT can reduce the attack success rate by 2 times and improve the accuracy by 5%, shedding light on backdoor cleansing.



Paperid:1871
Authors:Xinyu Zhang, Meng Kang, Shuai Lü
Jilin University, Jilin University, Jilin University
Abstract:
Recently, instance contrastive learning achieves good results in unsupervised domain adaptation. It reduces the distances between positive samples and the anchor, increases the distances between negative samples and the anchor, and learns discriminative feature representations for target samples. However, most recent methods for identifying positive and negative samples are based on whether the pseudolabels of samples and the pseudo-label of the anchor correspond to the same class. Due to the lack of target labels, many uncertain data are mistakenly labeled during the training process, and many low training potential data are also utilized. To address these problems, we propose Low Category Uncertainty and High Training Potential Instance Learning for Unsupervised Domain Adaptation (LUHP). We first propose a weight to measure the category uncertainty of the target sample. We can effectively filter the samples near the decision boundary through category uncertainty thresholds which are calculated by weights. Then we propose a new loss to focus on samples with high training potential. Finally, for anchors with low category uncertainty, we propose a sample reuse strategy to make the model more robust. We demonstrate the effectiveness of LUHP by showing the results of four datasets widely used in unsupervised domain adaptation.



Paperid:1872
Authors:Xuechen Zhang, Mingchen Li, Jiasi Chen, Christos Thrampoulidis, Samet Oymak
University of California Riverside, University of Michigan, Ann Arbor, University of Michigan, Ann Arbor, University of British Columbia, University of Michigan, Ann Arbor
Abstract:
Modern classification problems exhibit heterogeneities across individual classes: Each class may have unique attributes, such as sample size, label quality, or predictability (easy vs difficult), and variable importance at testtime. Without care, these heterogeneities impede the learning process, most notably, when optimizing fairness objectives. Confirming this, under a gaussian mixture setting, we show that the optimal SVM classifier for balanced accuracy needs to be adaptive to the class attributes. This motivates us to propose CAP: An effective and general method that generates a class-specific learning strategy (e.g.~hyperparameter) based on the attributes of that class. This way, optimization process better adapts to heterogeneities. CAP leads to substantial improvements over the naive approach of assigning separate hyperparameters to each class. We instantiate CAP for loss function design and post-hoc logit adjustment, with emphasis on label-imbalanced problems. We show that CAP is competitive with prior art and its flexibility unlocks clear benefits for fairness objectives beyond balanced accuracy. Finally, we evaluate CAP on problems with label noise as well as weighted test objectives to showcase how CAP can jointly adapt to different heterogeneities.



Paperid:1873
Authors:Yayu Zhang, Yuhua Qian, Guoshuai Ma, Keyin Zheng, Guoqing Liu, Qingfu Zhang
Shanxi University City University of Hong Kong, Shanxi University, North University of China, Shanxi University, Shanxi University City University of Hong Kong, City University of Hong Kong The City University of Hong Kong Shenzhen Research Institute
Abstract:
Multitask learning deals with multiple related tasks simultaneously by sharing knowledge. In a typical deep multi-task learning model, all tasks use the same feature space and share the latent knowledge. If the tasks are weakly correlated or some features are negatively correlated, sharing all knowledge often leads to negative knowledge transfer among. To overcome this issue, this paper proposes a Fisher sparse multi-task learning method. It can obtain a sparse sharing representation for each task. In such a way, tasks share features on a sparse subspace. Our method can ensure that the knowledge transferred among tasks is beneficial. Specifically, we first propose a sparse deep multi-task learning model, and then introduce Fisher sparse module into traditional deep multi-task learning to learn the sparse variables of task. By alternately updating the neural network parameters and sparse variables, a sparse sharing representation can be learned for each task. In addition, in order to reduce the computational overhead, an heuristic method is used to estimate the Fisher information of neural network parameters. Experimental results show that, comparing with other methods, our proposed method can improve the performance for all tasks, and has high sparsity in multi-task learning.



Paperid:1874
Authors:Yinmin Zhang, Jie Liu, Chuming Li, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang
University of Sydney Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong Shanghai Artificial Intelligence Laboratory, University of Sydney Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory, Peking University, Shanghai Artificial Intelligence Laboratory, Shanghai Artificial Intelligence Laboratory
Abstract:
Offlineto-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.



Paperid:1875
Authors:Yixuan Zhang, Boyu Li, Zenan Ling, Feng Zhou
Hangzhou Dianzi University, University of Technology Sydney, Huazhong University of Science and Technology, Renmin University of China
Abstract:
Discrimination can occur when the underlying unbiased labels are overwritten by an agent with potential bias, resulting in biased datasets that unfairly harm specific groups and cause classifiers to inherit these biases. In this paper, we demonstrate that despite only having access to the biased labels, it is possible to eliminate bias by filtering the fairest instances within the framework of confident learning. In the context of confident learning, low selfconfidence usually indicates potential label errors; however, this is not always the case. Instances, particularly those from underrepresented groups, might exhibit low confidence scores for reasons other than labeling errors. To address this limitation, our approach employs truncation of the confidence score and extends the confidence interval of the probabilistic threshold. Additionally, we incorporate with co-teaching paradigm for providing a more robust and reliable selection of fair instances and effectively mitigating the adverse effects of biased labels. Through extensive experimentation and evaluation of various datasets, we demonstrate the efficacy of our approach in promoting fairness and reducing the impact of label bias in machine learning models.



Paperid:1876
Authors:Yuhan Zhang, Xiaode Liu, Yuanpei Chen, Weihang Peng, Yufei Guo, Xuhui Huang, Zhe Ma
Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC, Intelligent Science and Technology Academy of CASIC, Intelligent Science & Technology Academy of CASIC
Abstract:
Spiking neural networks (SNNs) have attracted intensive attention as a promising energyefficient alternative to conventional artificial neural networks (ANNs) recently, which could transmit information in form of binary spikes rather than continuous activations thus the multiplication of activation and weight could be replaced by addition to save energy. However, the binary spike representation form will sacrifice the expression performance of SNNs and lead to accuracy degradation compared with ANNs. Considering improving feature representation is beneficial to training an accurate SNN model, this paper focuses on enhancing the feature representation of the SNN. To this end, we establish a similarity-sensitive contrastive learning framework, where SNN could capture significantly more information from its ANN counterpart to improve representation by Mutual Information (MI) maximization with layer-wise sensitivity to similarity. In specific, it enriches the SNN’s feature representation by pulling the positive pairs of SNN's and ANN's feature representation of each layer from the same input samples closer together while pushing the negative pairs from different samples further apart. Experimental results show that our method consistently outperforms the current state-of-the-art algorithms on both popular non-spiking static and neuromorphic datasets.



Paperid:1877
Authors:Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo
The Chinese University of Hong Kong, The Chinese University of Hong Kong, Tencent Inc., The Chinese University of Hong Kong, The Chinese University of Hong Kong, The University of Hong Kong
Abstract:
This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the selfattention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.



Paperid:1878
Authors:Zhe Zhang, Xiaoyang Tan
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence
Abstract:
We revisit behavior regularization, a popular approach to mitigate the extrapolation error in offline reinforcement learning (RL), showing that current behavior regularization may suffer from unstable learning and hinder policy improvement. Motivated by this, a novel reward shapingbased behavior regularization method is proposed, where the log-probability ratio between the learned policy and the behavior policy is monitored during learning. We show that this is equivalent to an implicit but computationally lightweight trust region mechanism, which is beneficial to mitigate the influence of estimation errors of the value function, leading to more stable performance improvement. Empirical results on the popular D4RL benchmark verify the effectiveness of the presented method with promising performance compared with some state-of-the-art offline RL algorithms.



Paperid:1879
Authors:Zhen Zhang, Junfeng Yang, Limei Liu, Xuesong Xu, Guozhen Rong, Qilong Feng
Hunan University of Technology and Business Xiangjiang Laboratory, Hunan University of Technology and Business Xiangjiang Laboratory, Hunan University of Technology and Business Xiangjiang Laboratory, Hunan University of Technology and Business Xiangjiang Laboratory, Changsha University of Science and Technology, Central South University Xiangjiang Laboratory
Abstract:
The representative kmedian problem generalizes the classical clustering formulations in that it partitions the data points into several disjoint demographic groups and poses a lower-bound constraint on the number of opened facilities from each group, such that all the groups are fairly represented by the opened facilities. Due to its simplicity, the local-search heuristic that optimizes an initial solution by iteratively swapping at most a constant number of closed facilities for the same number of opened ones (denoted by the O(1)-swap heuristic) has been frequently used in the representative k-median problem. Unfortunately, despite its good performance exhibited in experiments, whether the O(1)-swap heuristic has provable approximation guarantees for the case where the number of groups is more than 2 remains an open question for a long time. As an answer to this question, we show that the O(1)-swap heuristic (1) is guaranteed to yield a constant-factor approximation solution if the number of groups is a constant, and (2) has an unbounded approximation ratio otherwise. Our main technical contribution is a new approach for theoretically analyzing local-search heuristics, which derives the approximation ratio of the O(1)-swap heuristic via linearly combining the increased clustering costs induced by a set of hierarchically organized swaps.



Paperid:1880
Authors:Di Zhao, Yun Sing Koh, Gillian Dobbie, Hongsheng Hu, Philippe Fournier-Viger
School of Computer Science, University of Auckland, School of Computer Science, University of Auckland, School of Computer Science, University of Auckland, CSIRO‘s Data61, College of Computer Science and Software Engineering Shenzhen University
Abstract:
Deep learning methods often suffer performance degradation due to domain shift, where discrepancies exist between training and testing data distributions. Domain generalization mitigates this problem by leveraging information from multiple source domains to enhance model generalization capabilities for unseen domains. However, existing domain generalization methods typically present examples to the model in a random manner, overlooking the potential benefits of structured data presentation. To bridge this gap, we propose a novel learning strategy, Symmetric SelfPaced Learning (SSPL), for domain generalization. SSPL consists of a Symmetric Self-Paced training scheduler and a Gradient-based Difficulty Measure (GDM). Specifically, the proposed training scheduler initially focuses on easy examples, gradually shifting emphasis to harder examples as training progresses. GDM dynamically evaluates example difficulty through the gradient magnitude with respect to the example itself. Experiments across five popular benchmark datasets demonstrate the effectiveness of the proposed learning strategy.



Paperid:1881
Authors:Han Zhao, Xu Yang, Cheng Deng, Junchi Yan
Xidian University, Xidian University, Xidian University, Shanghai Jiao Tong University
Abstract:
Spiking Graph Neural Networks are emerging tools for analyzing graph data along with low energy consumption and certain biological fidelity. Existing methods directly integrate samereactive spiking neurons into graph neural networks for processing propagated graphs. However, such same-reactive neurons are not biological-functionality enough compared to the brain's dynamic-reactive ones, limiting the model's expression. Meanwhile, insufficient long-range neighbor information can be excavated with the few-step propagated graph, restricting discrimination of graph spiking embeddings. Inspired by the dynamic cognition in the brain, we propose a Dynamic Reactive Spiking Graph Neural Network that can enhance model's expressive ability in higher biological fidelity. Specifically, we design dynamic reactive spiking neurons to process spiking graph inputs, which have unique optimizable thresholds to spontaneously explore dynamic reactive states between neurons. Moreover, discriminative graph positional spikes are learned and integrated adaptively into spiking outputs through our neurons, thereby exploring long-range neighbors more thoroughly. Finally, with the dynamic reactive mechanism and learnable positional integration, we can obtain a powerful and highly bio-fidelity model with low energy consumption. Experiments on various domain-related datasets can demonstrate the effectiveness of our model. Our code is available at https://github.com/hzhao98/DRSGNN.



Paperid:1882
Authors:Kai Zhao, Chang Xu, Bailu Si
Beijing Normal University, Beijing Normal University, Beijing Normal University Chinese Institute for Brain Research, Beijing
Abstract:
Visual abstract reasoning tasks present challenges for deep neural networks, exposing limitations in their capabilities. In this work, we present a neural network model that addresses the challenges posed by Raven’s Progressive Matrices (RPM). Inspired by the twostream hypothesis of visual processing, we introduce the Dual-stream Reasoning Network (DRNet), which utilizes two parallel branches to capture image features. On top of the two streams, a reasoning module first learns to merge the high-level features of the same image. Then, it employs a rule extractor to handle combinations involving the eight context images and each candidate image, extracting discrete abstract rules and utilizing an multilayer perceptron (MLP) to make predictions. Empirical results demonstrate that the proposed DRNet achieves state-of-the-art average performance across multiple RPM benchmarks. Furthermore, DRNet demonstrates robust generalization capabilities, even extending to various out-of-distribution scenarios. The dual streams within DRNet serve distinct functions by addressing local or spatial information. They are then integrated into the reasoning module, leveraging abstract rules to facilitate the execution of visual reasoning tasks. These findings indicate that the dual-stream architecture could play a crucial role in visual abstract reasoning.



Paperid:1883
Authors:Na Zhao, Gim Hee Lee
Singapore University of Technology and Design, National University of Singapore
Abstract:
Learning from openworld noisy data, where both closed-set and open-set noise co-exist in the dataset, is a realistic but underexplored setting. Only recently, several efforts have been initialized to tackle this problem. However, these works assume the classes are balanced when dealing with open-world noisy data. This assumption often violates the nature of real-world large-scale datasets, where the label distributions are generally long-tailed, i.e. class-imbalanced. In this paper, we study the problem of robust visual recognition with class-imbalanced open-world noisy data. We propose a probabilistic graphical model-based approach: iMRF to achieve label noise correction that is robust to class imbalance via an efficient iterative inference of a Markov Random Field (MRF) in each training mini-batch. Furthermore, we design an agreement-based thresholding strategy to adaptively collect clean samples from all classes that includes corrected closed-set noisy samples while rejecting open-set noisy samples. We also introduce a noise-aware balanced cross-entropy loss to explicitly eliminate the bias caused by class-imbalanced data. Extensive experiments on several benchmark datasets including synthetic and real-world noisy datasets demonstrate the superior performance robustness of our method over existing methods. Our code is available at https://github.com/Na-Z/LIOND.



Paperid:1884
Authors:Pengfei Zhao, Haoren Zhu, Wilfred Siu Hung NG, Dik Lun Lee
Guangdong Provincial Key Laboratory of Interdisciplinary Research and Application for Data Science BNU-HKBU United International College, Hong Kong University of Science and Technology, Hong Kong University of Science and Technology, Hong Kong University of Science and Technology
Abstract:
Volatility, as a measure of uncertainty, plays a crucial role in numerous financial activities such as risk management. The Econometrics and Machine Learning communities have developed two distinct approaches for financial volatility forecasting: the stochastic approach and the neural network (NN) approach. Despite their individual strengths, these methodologies have conventionally evolved in separate research trajectories with little interaction between them. This study endeavors to bridge this gap by establishing an equivalence relationship between models of the GARCH family and their corresponding NN counterparts. With the equivalence relationship established, we introduce an innovative approach, named GARCHNN, for constructing NN-based volatility models. It obtains the NN counterparts of GARCH models and integrates them as components into an established NN architecture, thereby seamlessly infusing volatility stylized facts (SFs) inherent in the GARCH models into the neural network. We develop the GARCH-LSTM model to showcase the power of GARCH-NN approach. Experiment results validate that amalgamating the NN counterparts of the GARCH family models into established NN models leads to enhanced outcomes compared to employing the stochastic and NN models in isolation.



Paperid:1885
Authors:Puning Zhao, Zhiguo Wan
Zhejiang Lab, Hangzhou, Zhejiang, China, Zhejiang Lab, Hangzhou, Zhejiang, China
Abstract:
This paper studies robust nonparametric regression, in which an adversarial attacker can modify the values of up to q samples from a training dataset of size N. Our initial solution is an Mestimator based on Huber loss minimization. Compared with simple kernel regression, i.e. the Nadaraya-Watson estimator, this method can significantly weaken the impact of malicious samples on the regression performance. We provide the convergence rate as well as the corresponding minimax lower bound. The result shows that, with proper bandwidth selection, supremum error is minimax optimal. The L2 error is optimal with relatively small q, but is suboptimal with larger q. The reason is that this estimator is vulnerable if there are many attacked samples concentrating in a small region. To address this issue, we propose a correction method by projecting the initial estimate to the space of Lipschitz functions. The final estimate is nearly minimax optimal for arbitrary q, up to a logarithmic factor.



Paperid:1886
Authors:Wenhui Zhao, Guangfei Li, Haizhou Yang, Quanxue Gao, Qianqian Wang
School of Telecommunication Engineering, Xidian University, Shaanxi 710071, China, School of Telecommunication Engineering, Xidian University, Shaanxi 710071, China, School of Telecommunication Engineering, Xidian University, Shaanxi 710071, China, School of Telecommunication Engineering, Xidian University, Shaanxi 710071, China, School of Telecommunication Engineering, Xidian University, Shaanxi 710071, China Key Laboratory of Measurement and Control of Complex Systems of Engineering (Southeast University), Ministry of Education.
Abstract:
Recently, anchor graphbased multi-view clustering has been proven to be highly efficient for large-scale data processing. However, most existing anchor graph-based clustering methods necessitate post-processing to obtain clustering labels and are unable to effectively utilize the information within anchor graphs. To solve these problems, we propose an Embedded Feature Selection on Graph-Based Multi-View Clustering (EFSGMC) approach to improve the clustering performance. Our method decomposes anchor graphs, taking advantage of memory efficiency, to obtain clustering labels in a single step without the need for post-processing. Furthermore, we introduce the l2,p-norm for graph-based feature selection, which selects the most relevant data for efficient graph factorization. Lastly, we employ the tensor Schatten p-norm as a tensor rank approximation function to capture the complementary information between different views, ensuring similarity between cluster assignment matrices. Experimental results on five real-world datasets demonstrate that our proposed method outperforms state-of-the-art approaches.



Paperid:1887
Authors:Xilong Zhao, Siyuan Bian, Yaoyun Zhang, Yuliang Zhang, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Outof-distribution (OOD) generalization has long been a challenging problem that remains largely unsolved. Gaussian processes (GP), as popular probabilistic model classes, especially in the small data regime, presume strong OOD generalization abilities. Surprisingly, their OOD generalization abilities have been under-explored before compared with other lines of GP research. In this paper, we identify that GP is not free from the problem and propose a domain invariant learning algorithm for Gaussian processes (DIL-GP) with a min-max optimization on the likelihood. DIL-GP discovers the heterogeneity in the data and forces invariance across partitioned subsets of data. We further extend the DIL-GP to improve Bayesian optimization's adaptability on changing environments. Numerical experiments demonstrate the superiority of DIL-GP for predictions on several synthetic and real-world datasets. We further demonstrate the effectiveness of the DIL-GP Bayesian optimization method on a PID parameters tuning experiment for a quadrotor. The full version and source code are available at: https://github.com/Billzxl/DIL-GP.



Paperid:1888
Authors:Yanxuan Zhao, Peng Zhang, Guopeng Sun, Zhigong Yang, Jianqiang Chen, Yueqing Wang
Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Computational Aerodynamics Institute, China Aerodynamics Research and Development Center
Abstract:
Engineering design methods aim to generate new designs that meet desired performance requirements. Past work has directly introduced conditional Generative Adversarial Networks (cGANs) into this field and achieved promising results in singlepoint design problems(one performance requirement under one working condition). However, these methods assume that the performance requirements are distributed in categorical space, which is not reasonable in these scenarios. Although Continuous conditional GANs (CcGANs) introduce Vicinal Risk Minimization (VRM) to reduce the performance loss caused by this assumption, they still face the following challenges: 1) CcGANs can not handle multi-point design problems (multiple performance requirements under multiple working conditions). 2) Their training process is time-consuming due to the high computational complexity of the vicinal loss. To address these issues, A Continuous conditional Diffusion Probabilistic Model (CcDPM) is proposed, which the first time introduces the diffusion model into the engineering design area and VRM into the diffusion model. CcDPM adopts a novel sampling method called multi-point design sampling to deal with multi-point design problems. Moreover, the k-d tree is used in the training process of CcDPM to shorten the calculation time of vicinal loss and speed up the training process by 2-300 times in our experiments. Experiments on a synthetic problem and three real-world design problems demonstrate that CcDPM outperforms the state-of-the-art GAN models.



Paperid:1889
Authors:Zhe Zhao, Pengkun Wang, Haibin Wen, Yudong Zhang, Zhengyang Zhou, Yang Wang
University of Science and Technology of China City University of Hong Kong, University of Science and Technology of China, Shaoguan University, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China
Abstract:
Graph neural networks (GNNs) have achieved stateof-the-art results on many graph representation learning tasks by exploiting statistical correlations. However, numerous observations have shown that such correlations may not reflect the true causal mechanisms underlying the data and thus may hamper the ability of the model to generalize beyond the observed distribution. To address this problem, we propose an Information-based Causal Learning (ICL) framework that combines information theory and causality to analyze and improve graph representation learning to transform information relevance to causal dependence. Specifically, we first introduce a multi-objective mutual information optimization objective derived from information-theoretic analysis and causal learning principles to simultaneously extract invariant and interpretable causal information and reduce reliance on non-causal information in correlations. To optimize this multi-objective objective, we enable a causal disentanglement layer that effectively decouples the causal and non-causal information in the graph representations. Moreover, due to the intractability of mutual information estimation, we derive variational bounds that enable us to transform the above objective into a tractable loss function. To balance the multiple information objectives and avoid optimization conflicts, we leverage multi-objective gradient descent to achieve a stable and efficient transformation from informational correlation to causal dependency. Our approach provides important insights into modulating the information flow in GNNs to enhance their reliability and generalization. Extensive experiments demonstrate that our approach significantly improves the robustness and interpretability of GNNs across different distribution shifts. Visual analysis demonstrates how our method converts informative dependencies in representations into causal dependencies.



Paperid:1890
Authors:Shenghe Zheng, Hongzhi Wang, Tianyu Mu
Massive Data Computing Lab, Harbin Institute of Technology, Massive Data Computing Lab, Harbin Institute of Technology, Massive Data Computing Lab, Harbin Institute of Technology
Abstract:
Neural predictors have shown great potential in the evaluation process of neural architecture search (NAS). However, current predictorbased approaches overlook the fact that training a predictor necessitates a considerable number of trained neural networks as the labeled training set, which is costly to obtain. Therefore, the critical issue in utilizing predictors for NAS is to train a high-performance predictor using as few trained neural networks as possible. Although some methods attempt to address this problem through unsupervised learning, they often result in inaccurate predictions. We argue that the unsupervised tasks intended for the common graph data are too challenging for neural networks, causing unsupervised training to be susceptible to performance crashes in NAS. To address this issue, we propose a CurricuLum-guided Contrastive Learning framework for neural Predictor (DCLP). Our method simplifies the contrastive task by designing a novel curriculum to enhance the stability of unlabeled training data distribution during contrastive training. Specifically, we propose a scheduler that ranks the training data according to the contrastive difficulty of each data and then inputs them to the contrastive learner in order. This approach concentrates the training data distribution and makes contrastive training more efficient. By using our method, the contrastive learner incrementally learns feature representations via unsupervised data on a smooth learning curve, avoiding performance crashes that may occur with excessively variable training data distributions. We experimentally demonstrate that DCLP has high accuracy and efficiency compared with existing predictors, and shows promising potential to discover superior architectures in various search spaces when combined with search strategies. Our code is available at: https://github.com/Zhengsh123/DCLP.



Paperid:1891
Authors:Churan Zhi, Junbao Zhuo, Shuhui Wang
Beijing Jiaotong University, Beijing, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Abstract:
In this paper, we address unsupervised domain adaptation under noisy environments, which is more challenging and practical than traditional domain adaptation. In this scenario, the model is prone to overfitting noisy labels, resulting in a more pronounced domain shift and a notable decline in the overall model performance. Previous methods employed prototype methods for domain adaptation on robust feature spaces. However, these approaches struggle to effectively classify classes with similar features under noisy environments. To address this issue, we propose a new method to detect and correct confusing class pair. We first divide classes into easy and hard classes based on the small loss criterion. We then leverage the top2 predictions for each sample after aligning the source and target domain to find the confusing pair in the hard classes. We apply label correction to the noisy samples within the confusing pair. With the proposed label correction method, we can train our model with more accurate labels. Extensive experiments confirm the effectiveness of our method and demonstrate its favorable performance compared with existing state-of-the-art methods. Our codes are publicly available at https://github.com/Hehxcf/CPC/.



Paperid:1892
Authors:Mingjian Zhi, Yuanguo Bi, Wenchao Xu, Haozhao Wang, Tianao Xiang
Northeastern University, China, Northeastern University, China, The Hong Kong Polytechnic University, Hong Kong, China, Huazhong University of Science and Technology, China, Northeastern University, China
Abstract:
Personalized Federated Learning (pFL) can effectively exploit the nonIID data from distributed clients by customizing personalized models. Existing pFL methods either simply take the local model as a whole for aggregation or require significant training overhead to induce the inter-client personalized weights, and thus clients cannot efficiently exploit the mutually relevant knowledge from each other. In this paper, we propose a knowledge-aware parameter coaching scheme where each client can swiftly and granularly refer to parameters of other clients to guide the local training, whereby accurate personalized client models can be efficiently produced without contradictory knowledge. Specifically, a novel regularizer is designed to conduct layer-wise parameters coaching via a relation cube, which is constructed based on the knowledge represented by the layered parameters among all clients. Then, we develop an optimization method to update the relation cube and the parameters of each client. It is theoretically demonstrated that the convergence of the proposed method can be guaranteed under both convex and non-convex settings. Extensive experiments are conducted over various datasets, which show that the proposed method can achieve better performance compared with the state-of-the-art baselines in terms of accuracy and convergence speed.



Paperid:1893
Authors:Dianyu Zhong, Yiqin Yang, Qianchuan Zhao
Tsinghua University, Tsinghua University, Tsinghua University
Abstract:
The large action space is one fundamental obstacle to deploying Reinforcement Learning methods in the real world. The numerous redundant actions will cause the agents to make repeated or invalid attempts, even leading to task failure. Although current algorithms conduct some initial explorations for this issue, they either suffer from rulebased systems or depend on expert demonstrations, which significantly limits their applicability in many real-world settings. In this work, we examine the theoretical analysis of what action can be eliminated in policy optimization and propose a novel redundant action filtering mechanism. Unlike other works, our method constructs the similarity factor by estimating the distance between the state distributions, which requires no prior knowledge. In addition, we combine the modified inverse model to avoid extensive computation in high-dimensional state space. We reveal the underlying structure of action spaces and propose a simple yet efficient redundant action filtering mechanism named No Prior Mask (NPM) based on the above techniques. We show the superior performance of our method by conducting extensive experiments on high-dimensional, pixel-input, and stochastic problems with various action redundancy tasks. Our code is public online at https://github.com/zhongdy15/npm.



Paperid:1894
Authors:Ruizhe Zhong, Junjie Ye, Zhentao Tang, Shixiong Kai, Mingxuan Yuan, Jianye Hao, Junchi Yan
Shanghai Jiao Tong University, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab Tianjin University, Shanghai Jiao Tong University
Abstract:
Prerouting timing prediction has been recently studied for evaluating the quality of a candidate cell placement in chip design. It involves directly estimating the timing metrics for both pin-level (slack, slew) and edge-level (net delay, cell delay), without time-consuming routing. However, it often suffers from signal decay and error accumulation due to the long timing paths in large-scale industrial circuits. To address these challenges, we propose a two-stage approach. First, we propose global circuit training to pre-train a graph auto-encoder that learns the global graph embedding from circuit netlist. Second, we use a novel node updating scheme for message passing on GCN, following the topological sorting sequence of the learned graph embedding and circuit graph. This scheme residually models the local time delay between two adjacent pins in the updating sequence, and extracts the lookup table information inside each cell via a new attention mechanism. To handle large-scale circuits efficiently, we introduce an order preserving partition scheme that reduces memory consumption while maintaining the topological dependencies. Experiments on 21 real world circuits achieve a new SOTA R2 of 0.93 for slack prediction, which is significantly surpasses 0.59 by previous SOTA method. Code will be available at: https://github.com/Thinklab-SJTU/EDA-AI.



Paperid:1895
Authors:Chaoyang Zhou, Zengmao Wang, Bo Du, Yong Luo
School of Computer Science, Wuhan University, School of Computer Science, Wuhan University National Engineering Research Center for Multimedia Software, Wuhan University Institute of Artificial Intelligence, Wuhan University Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University Hubei Luojia Laboratory, China, School of Computer Science, Wuhan University National Engineering Research Center for Multimedia Software, Wuhan University Institute of Artificial Intelligence, Wuhan University Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University Hubei Luojia Laboratory, China, School of Computer Science, Wuhan University National Engineering Research Center for Multimedia Software, Wuhan University Institute of Artificial Intelligence, Wuhan University Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University Hubei Luojia Laboratory, China
Abstract:
Multisource domain adaptation (MSDA) aims to transfer knowledge from multiple source domains to the unlabeled target domain. In this paper, we propose a cycle self-refinement domain adaptation method, which progressively attempts to learn the dominant transferable knowledge in each source domain in a cycle manner. Specifically, several source-specific networks and a domain-ensemble network are adopted in the proposed method. The source-specific networks are adopted to provide the dominant transferable knowledge in each source domain for instance-level ensemble on predictions of the samples in target domain. Then these samples with high-confidence ensemble predictions are adopted to refine the domain-ensemble network. Meanwhile, to guide each source-specific network to learn more dominant transferable knowledge, we force the features of the target domain from the domain-ensemble network and the features of each source domain from the corresponding source-specific network to be aligned with their predictions from the corresponding networks. Thus the adaptation ability of source-specific networks and the domain-ensemble network can be improved progressively. Extensive experiments on Office-31, Office-Home and DomainNet show that the proposed method outperforms the state-of-the-art methods for most tasks.



Paperid:1896
Authors:Huilin Zhou, Hao Zhang, Huiqi Deng, Dongrui Liu, Wen Shen, Shih-Han Chan, Quanshi Zhang
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University University of California San Diego, Shanghai Jiao Tong University
Abstract:
This paper explains the generalization power of a deep neural network (DNN) from the perspective of interactions. Although there is no universally accepted definition of the concepts encoded by a DNN, the sparsity of interactions in a DNN has been proved, i.e., the output score of a DNN can be well explained by a small number of interactions between input variables. In this way, to some extent, we can consider such interactions as interactive concepts encoded by the DNN. Therefore, in this paper, we derive an analytic explanation of inconsistency of concepts of different complexities. This may shed new lights on using the generalization power of concepts to explain the generalization power of the entire DNN. Besides, we discover that the DNN with stronger generalization power usually learns simple concepts more quickly and encodes fewer complex concepts. We also discover the detouring dynamics of learning complex concepts, which explains both the high learning difficulty and the low generalization power of complex concepts. The code will be released when the paper is accepted.



Paperid:1897
Authors:Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, Kai Gao
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China, School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China
Abstract:
Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in realworld multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.



Paperid:1898
Authors:Qihua Zhou, Jingcai Guo, Song Guo, Ruibin Li, Jie Zhang, Bingjie Wang, Zhenda Xu
The Hong Kong Polytechnic University, Hong Kong, The Hong Kong Polytechnic University, Hong Kong The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, China, The Hong Kong University of Science and Technology, Hong Kong, The Hong Kong Polytechnic University, Hong Kong, The Hong Kong Polytechnic University, Hong Kong, The Hong Kong Polytechnic University, Hong Kong, Hong Kong Polytechnic University, Hong Kong
Abstract:
The explosive growth of video traffic on today's Internet promotes the rise of Neuralenhanced Video Streaming (NeVS), which effectively improves the rate-distortion trade-off by employing a cheap neural super-resolution model for quality enhancement on the receiver side. Missing by existing work, we reveal that the NeVS pipeline may suffer from a practical threat, where the crucial codec component (i.e., encoder for compression and decoder for restoration) can trigger adversarial attacks in a man-in-the-middle manner to significantly destroy video recovery performance and finally incurs the malfunction of downstream video perception tasks. In this paper, we are the first attempt to inspect the vulnerability of NeVS and discover a novel adversarial attack, called codec hijacking, where the injected invisible perturbation conspires with the malicious encoding matrix by reorganizing the spatial-temporal bit allocation within the bitstream size budget. Such a zero-day vulnerability makes our attack hard to defend because there is no visual distortion on the recovered videos until the attack happens. More seriously, this attack can be extended to diverse enhancement models, thus exposing a wide range of video perception tasks under threat. Evaluation based on state-of-the-art video codec benchmark illustrates that our attack significantly degrades the recovery performance of NeVS over previous attack methods. The damaged video quality finally leads to obvious malfunction of downstream tasks with over 75% success rate. We hope to arouse public attention on codec hijacking and its defence.



Paperid:1899
Authors:Renzhe Zhou, Chen-Xiao Gao, Zongzhang Zhang, Yang Yu
National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Abstract:
Generalization and sample efficiency have been longstanding issues concerning reinforcement learning, and thus the field of Offline Meta-Reinforcement Learning (OMRL) has gained increasing attention due to its potential of solving a wide range of problems with static and limited offline data. Existing OMRL methods often assume sufficient training tasks and data coverage to apply contrastive learning to extract task representations. However, such assumptions are not applicable in several real-world applications and thus undermine the generalization ability of the representations. In this paper, we consider OMRL with two types of data limitations: limited training tasks and limited behavior diversity and propose a novel algorithm called GENTLE for learning generalizable task representations in the face of data limitations. GENTLE employs Task Auto-Encoder (TAE), which is an encoder-decoder architecture to extract the characteristics of the tasks. Unlike existing methods, TAE is optimized solely by reconstruction of the state transition and reward, which captures the generative structure of the task models and produces generalizable representations when training tasks are limited. To alleviate the effect of limited behavior diversity, we consistently construct pseudo-transitions to align the data distribution used to train TAE with the data distribution encountered during testing. Empirically, GENTLE significantly outperforms existing OMRL methods on both in-distribution tasks and out-of-distribution tasks across both the given-context protocol and the one-shot protocol.



Paperid:1900
Authors:Xiaochen Zhou, Xudong Wang
Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Training data in federated learning (FL) frameworks can have label noise, since they must be stored and annotated on clients' devices. If trained over such corrupted data, the models learn the wrong knowledge of label noise, which highly degrades their performance. Although several FL schemes are designed to combat label noise, they suffer performance degradation when the clients' devices only have limited local training samples. To this end, a new scheme called federated labelnoise learning (FedLNL) is developed in this paper. The key problem of FedLNL is how to estimate a noise transition matrix (NTM) accurately in the case of limited local training samples. If a gradient-based update method is used to update the local NTM on each client's device, it can generate too large gradients for the local NTM, causing a high estimation error of the local NTM. To tackle this issue, an alternating update method for the local NTM and the local classifier is designed in FedLNL, where the local NTM is updated by a Bayesian inference-based update method. Such an alternating update method makes the loss function of existing NTM-based schemes not applicable to FedLNL. To enable federated optimization of FedLNL, a new regularizer on the parameters of the classifier called local diversity product regularizer is designed for the loss function of FedLNL. The results show that FedLNL improves the test accuracy of a trained model by up to 25.98%, compared with the state-of-the-art FL schemes that tackle label-noise issues.



Paperid:1901
Authors:Anjie Zhu, Peng-Fei Zhang, Ruihong Qiu, Zetao Zheng, Zi Huang, Jie Shao
University of Electronic Science and Technology of China, China, The University of Queensland, Australia, The University of Queensland, Australia, University of Electronic Science and Technology of China, China, The University of Queensland, Australia, University of Electronic Science and Technology of China, China
Abstract:
Intrinsic motivation lies at the heart of the exploration of reinforcement learning, which is primarily driven by the agent's inherent satisfaction rather than external feedback from the environment. However, in recent more challenging procedurallygenerated environments with high stochasticity and uninformative extrinsic rewards, we identify two significant issues of applying intrinsic motivation. (1) State representation collapse: In existing methods, the learned representations within intrinsic motivation have a high probability to neglect the distinction among different states and be distracted by the task-irrelevant information brought by the stochasticity. (2) Insufficient interrelation among dynamics: Unsuccessful guidance provided by the uninformative extrinsic reward makes the dynamics learning in intrinsic motivation less effective. In light of the above observations, a novel Behavioral metric with Cyclic Dynamics (BCD) is proposed, which considers both cumulative and immediate effects and facilitates the abstraction and exploration of the agent. For the behavioral metric, the successor feature is utilized to reveal the expected future rewards and alleviate the heavy reliance of previous methods on extrinsic rewards. Moreover, the latent variable and vector quantization techniques are employed to enable an accurate measurement of the transition function in a discrete and interpretable manner. In addition, cyclic dynamics is established to capture the interrelations between state and action, thereby providing a thorough awareness of environmental dynamics. Extensive experiments conducted on procedurally-generated environments demonstrate the state-of-the-art performance of our proposed BCD.



Paperid:1902
Authors:Jianping Zhu, Xin Guo, Yang Chen, Yao Yang, Wenbo Li, Bo Jin, Fei Wu
Dalian University of Technology, Dalian University of Technology, ZHEJIANG UNIVERSITY, Zhejiang Lab, Zhejiang laboratory, Dalian University of Technology, Zhejiang University, China
Abstract:
Long sequence prediction has broad and significant application value in fields such as finance, wind power, and weather. However, the complex longterm dependencies of long sequence data and the potential domain shift problems limit the effectiveness of traditional models in practical scenarios. To this end, we propose an Adaptive Meta-Learning Probabilistic Inference Framework (AMPIF) based on sequence decomposition, which can effectively enhance the long sequence prediction ability of various basic models. Specifically, first, we decouple complex sequences into seasonal and trend components through a frequency domain decomposition module. Then, we design an adaptive meta-learning task construction strategy, which divides the seasonal and trend components into different tasks through a clustering-matching approach. Finally, we design a dual-stream amortized network (ST-DAN) to capture shared information between seasonal-trend tasks and use the support set to generate task-specific parameters for rapid generalization learning on the query set. We conducted extensive experiments on six datasets, including wind power and finance scenarios, and the results show that our method significantly outperforms baseline methods in prediction accuracy, interpretability, and algorithm stability and can effectively enhance the long sequence prediction capabilities of base models. The source code is publicly available at https://github.com/Zhu-JP/AMPIF.



Paperid:1903
Authors:Kun Zhu, Chunhui Zhao
College of Control Science and Engineering, Zhejiang University, Hangzhou, China, College of Control Science and Engineering, Zhejiang University, Hangzhou, China
Abstract:
Causal discovery under Granger causality framework has yielded widespread concerns in time series analysis task. Nevertheless, most previous methods are unaware of the underlying causality disappearing problem, that is, certain weak causalities are less focusable and may be lost during the modeling process, thus leading to biased causal conclusions. Therefore, we propose to introduce joint causal influences (i.e., causal influences from the union of multiple variables) as additional causal indication information to help identify weak causalities. Further, to break the limitation of existing methods that implicitly and coarsely model joint causal influences, we propose a novel hidden variabledriven causal hypergraph neural network to meticulously explore the locality and diversity of joint causal influences, and realize its explicit and fine-grained modeling. Specifically, we introduce hidden variables to construct a causal hypergraph for explicitly characterizing various fine-grained joint causal influences. Then, we customize a dual causal information transfer mechanism (encompassing a multi-level causal path and an information aggregation path) to realize the free diffusion and meticulous aggregation of joint causal influences and facilitate its adaptive learning. Finally, we design a multi-view collaborative optimization constraint to guarantee the characterization diversity of causal hypergraph and capture remarkable forecasting relationships (i.e., causalities). Experiments are conducted to demonstrate the superiority of the proposed model.



Paperid:1904
Authors:Minqin Zhu, Anpeng Wu, Haoxuan Li, Ruoxuan Xiong, Bo Li, Xiaoqing Yang, Xuan Qin, Peng Zhen, Jiecheng Guo, Fei Wu, Kun Kuang
Department of Computer Science and Technology, Zhejiang University, Department of Computer Science and Technology, Zhejiang University Mohamed bin Zayed University of Artificial Intelligence, Center for Data Science, Peking University, Department of Quantitative Theory and Methods, Emory University, School of Economics and Management, Tsinghua University, Didi Chuxing, Didi Chuxing, Didi Chuxing, Didi Chuxing, Department of Computer Science and Technology, Zhejiang University, Department of Computer Science and Technology, Zhejiang University
Abstract:
Estimating the individuals' potential response to varying treatment doses is crucial for decisionmaking in areas such as precision medicine and management science. Most recent studies predict counterfactual outcomes by learning a covariate representation that is independent of the treatment variable. However, such independence constraints neglect much of the covariate information that is useful for counterfactual prediction, especially when the treatment variables are continuous. To tackle the above issue, in this paper, we first theoretically demonstrate the importance of the balancing and prognostic representations for unbiased estimation of the heterogeneous dose-response curves, that is, the learned representations are constrained to satisfy the conditional independence between the covariates and both of the treatment variables and the potential responses. Based on this, we propose a novel Contrastive balancing Representation learning Network using a partial distance measure, called CRNet, for estimating the heterogeneous dose-response curves without losing the continuity of treatments. Extensive experiments are conducted on synthetic and real-world datasets demonstrating that our proposal significantly outperforms previous methods.



Paperid:1905
Authors:Pengfei Zhu, Qian Wang, Yu Wang, Jialu Li, Qinghua Hu
Tianjin University, Tianjin University, Tianjin University, Tianjin University, Tianjin University
Abstract:
Attributed graph clustering is an unsupervised task that partitions nodes into different groups. Selfsupervised learning (SSL) shows great potential in handling this task, and some recent studies simultaneously learn multiple SSL tasks to further boost performance. Currently, different SSL tasks are assigned the same set of weights for all graph nodes. However, we observe that some graph nodes whose neighbors are in different groups require significantly different emphases on SSL tasks. In this paper, we propose to dynamically learn the weights of SSL tasks for different nodes and fuse the embeddings learned from different SSL tasks to boost performance. We design an innovative graph clustering approach, namely Dynamically Fusing Self-Supervised Learning (DyFSS). Specifically, DyFSS fuses features extracted from diverse SSL tasks using distinct weights derived from a gating network. To effectively learn the gating network, we design a dual-level self-supervised strategy that incorporates pseudo labels and the graph structure. Extensive experiments on five datasets show that DyFSS outperforms the state-of-the-art multi-task SSL methods by up to 8.66% on the accuracy metric. The code of DyFSS is available at: https://github.com/q086/DyFSS.



Paperid:1906
Authors:Sheng Zhu, Chun Shen, Shuai Lü, Junhong Wu, Daolong An
Jilin University, Jilin University, Jilin University, Jilin University, Jilin University
Abstract:
CEMTD3 is a combination scheme using the simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy gradient (TD3), and it achieves a satisfactory trade-off between performance and sample efficiency. However, we find that CEM-TD3 cannot fully address the low efficiency of policy search caused by CEM, and the policy gradient learning introduced by TD3 will weaken the diversity of individuals in the population. In this paper, we propose Double Buffers CEM-TD3 (DBCEM-TD3) that optimizes both CEM and TD3. For CEM, DBCEM-TD3 maintains an actor buffer to store the population required for evolution. In each iteration, it only needs to generate a small number of actors to replace the poor actors in the policy buffer to achieve more efficient evolution. The fitness of individuals in the actor buffer decreases exponentially with time, which can avoid premature convergence of the mean actor. For TD3, DBCEM-TD3 maintains a critic buffer with the same number of critics as the number of actors generated in each iteration, and each critic is trained independently by sampling from the shared replay buffer. In each iteration, each newly generated actor uses different critics to guide learning. This ensures more diverse behaviors among the learned actors, enabling richer experiences to be collected during the evaluation phase. We conduct experimental evaluations on five continuous control tasks provided by OpenAI Gym. DBCEM-TD3 outperforms CEM-TD3, TD3, and other classic off-policy reinforcement learning algorithms in terms of performance and sample efficiency.



Paperid:1907
Authors:Tianchen Zhu, Yue Qiu, Haoyi Zhou, Jianxin Li
School of Computer Science and Engineering, Beihang University, School of Computer Science and Engineering, Beihang University, Zhongguancun Laboratory, Beijing, China School of Software, Beihang University, School of Computer Science and Engineering, Beihang University Zhongguancun Laboratory, Beijing, China
Abstract:
Designing accurate reward functions for reinforcement learning (RL) has long been challenging. Preferencebased RL (PbRL) offers a promising approach by using human preferences to train agents, eliminating the need for manual reward design. While successful in single-agent tasks, extending PbRL to complex multi-agent scenarios is nontrivial. Existing PbRL methods lack the capacity to comprehensively capture both temporal and cooperative aspects, leading to inadequate reward functions. This work introduces an advanced multi-agent preference learning framework that effectively addresses these limitations. Based on a cascading Transformer architecture, our approach captures both temporal and cooperative dependencies, alleviating issues related to reward uniformity and intricate interactions among agents. Experimental results demonstrate substantial performance improvements in multi-agent cooperative tasks, and the reconstructed reward function closely resembles expert-defined reward functions. The source code is available at https://github.com/catezi/MAPT.



Paperid:1908
Authors:Yifan Zhu, Lijia Yu, Xiao-Shan Gao
Academy of Mathematics and Systems Science, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences University of Chinese Academy of Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Privacy preserving has become increasingly critical with the emergence of social media. Unlearnable examples have been proposed to avoid leaking personal information on the Internet by degrading the generalization abilities of deep learning models. However, our study reveals that unlearnable examples are easily detectable. We provide theoretical results on linear separability of certain unlearnable poisoned dataset and simple networkbased detection methods that can identify all existing unlearnable examples, as demonstrated by extensive experiments. Detectability of unlearnable examples with simple networks motivates us to design a novel defense method. We propose using stronger data augmentations coupled with adversarial noises generated by simple networks, to degrade the detectability and thus provide effective defense against unlearnable examples with a lower cost. Adversarial training with large budgets is a widely-used defense method on unlearnable examples. We establish quantitative criteria between the poison and adversarial budgets, which determine the existence of robust unlearnable examples or the failure of the adversarial defense.



Paperid:1909
Authors:Yonghua Zhu, Lei Feng, Zhenyun Deng, Yang Chen, Robert Amor, Michael Witbrock
NAOInstitute, University of Auckland, NZ School of Computer Science, University of Auckland, NZ, School of Computer Science and Engineering, Nanyang Technological University, Singapore, Department of Computer Science, University of Cambridge, UK, NAOInstitute, University of Auckland, NZ School of Computer Science, University of Auckland, NZ, School of Computer Science, University of Auckland, NZ, NAOInstitute, University of Auckland, NZ School of Computer Science, University of Auckland, NZ
Abstract:
Current research for node classification focuses on dealing with either graph noise or label noise, but few studies consider both of them. In this paper, we propose a new robust node classification method to simultaneously deal with graph noise and label noise. To do this, we design a graph contrastive loss to conduct local graph learning and employ selfattention to conduct global graph learning. They enable us to improve the expressiveness of node representation by using comprehensive information among nodes. We also utilize pseudo graphs and pseudo labels to deal with graph noise and label noise, respectively. Furthermore, We numerically validate the superiority of our method in terms of robust node classification compared with all comparison methods.



Paperid:1910
Authors:Zhiyu Zhu, Huaming Chen, Jiayu Zhang, Xinyi Wang, Zhibo Jin, Minhui Xue, Dongxiao Zhu, Kim-Kwang Raymond Choo
The University of Sydney, The University of Sydney, SuZhouYierqi, University of Malaya, The University of Sydney, CSIRO's Data61, Wayne State University, University of Texas at San Antonio
Abstract:
To better understand the output of deep neural networks (DNN), attribution based methods have been an important approach for model interpretability, which assign a score for each input dimension to indicate its importance towards the model outcome. Notably, the attribution methods use the axioms of sensitivity and implementation invariance to ensure the validity and reliability of attribution results. Yet, the ex- isting attribution methods present challenges for effective in- terpretation and efficient computation. In this work, we in- troduce MFABA, an attribution algorithm that adheres to ax- ioms, as a novel method for interpreting DNN. Addition- ally, we provide the theoretical proof and in-depth analy- sis for MFABA algorithm, and conduct a large scale exper- iment. The results demonstrate its superiority by achieving over 101.5142 times faster speed than the state-of-the-art at- tribution algorithms. The effectiveness of MFABA is thor- oughly evaluated through the statistical analysis in compar- ison to other methods, and the full implementation package is open-source at: https://github.com/LMBTough/MFABA.



Paperid:1911
Authors:Huiping Zhuang, Run He, Kai Tong, Ziqian Zeng, Cen Chen, Zhiping Lin
South China University of Technology, South China University of Technology, South China University of Technology, South China University of Technology, South China University of Technology Pazhou Laboratory, China, Nanyang Technological University
Abstract:
Classincremental learning (CIL) under an exemplar-free constraint has presented a significant challenge. Existing methods adhering to this constraint are prone to catastrophic forgetting, far more so than replay-based techniques that retain access to past samples. In this paper, to solve the exemplar-free CIL problem, we propose a Dual-Stream Analytic Learning (DS-AL) approach. The DS-AL contains a main stream offering an analytical (i.e., closed-form) linear solution, and a compensation stream improving the inherent under-fitting limitation due to adopting linear mapping. The main stream redefines the CIL problem into a Concatenated Recursive Least Squares (C-RLS) task, allowing an equivalence between the CIL and its joint-learning counterpart. The compensation stream is governed by a Dual-Activation Compensation (DAC) module. This module re-activates the embedding with a different activation function from the main stream one, and seeks fitting compensation by projecting the embedding to the null space of the main stream's linear mapping. Empirical results demonstrate that the DS-AL, despite being an exemplar-free technique, delivers performance comparable with or better than that of replay-based methods across various datasets, including CIFAR-100, ImageNet-100 and ImageNet-Full. Additionally, the C-RLS' equivalent property allows the DS-AL to execute CIL in a phase-invariant manner. This is evidenced by a never-before-seen 500-phase CIL ImageNet task, which performs on a level identical to a 5-phase one. Our codes are available at https://github.com/ZHUANGHP/Analytic-continual-learning.



Paperid:1912
Authors:Zhengyang Zhuge, Jiaxing Wang, Yong Li, Yongjun Bao, Peisong Wang, Jian Cheng
Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, JD.com, JD.com, JD.com, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences AiRiA, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences AiRiA
Abstract:
Nowadays sample selection is drawing increasing attention. By extracting and training only on the most informative subset, sample selection can effectively reduce the training cost. Although sample selection is effective in conventional supervised learning, applying it to Masked Image Modeling (MIM) still poses challenges due to the gap between samplelevel selection and patch-level pre-training. In this paper, we inspect the sample selection in MIM pre-training and find the basic selection suffers from performance degradation. We attribute this degradation primarily to 2 factors: the random mask strategy and the simple averaging function. We then propose Patch-Aware Sample Selection (PASS), including a low-cost Dynamic Trained Mask Predictor (DTMP) and Weighted Selection Score (WSS). DTMP consistently masks the informative patches in samples, ensuring a relatively accurate representation of selection score. WSS enhances the selection score using patch-level disparity. Extensive experiments show the effectiveness of PASS in selecting the most informative subset and accelerating pretraining. PASS exhibits superior performance across various datasets, MIM methods, and downstream tasks. Particularly, PASS improves MAE by 0.7% on ImageNet-1K while utilizing only 37% data budget and achieves ~1.7x speedup.



Paperid:1913
Authors:Chen-Chen Zong, Ye-Wen Wang, Ming-Kun Xie, Sheng-Jun Huang
College of Computer Science and Technology/Artificial Intelligence, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China, College of Computer Science and Technology/Artificial Intelligence, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China, College of Computer Science and Technology/Artificial Intelligence, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China, College of Computer Science and Technology/Artificial Intelligence, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
Abstract:
Learning with noisy labels can significantly hinder the generalization performance of deep neural networks (DNNs). Existing approaches address this issue through loss correction or example selection methods. However, these methods often rely on the model's predictions obtained from the softmax function, which can be overconfident and unreliable. In this study, we identify the translation invariance of the softmax function as the underlying cause of this problem and propose the \textit{Dirichlet-based Prediction Calibration} (DPC) method as a solution. Our method introduces a calibrated softmax function that breaks the translation invariance by incorporating a suitable constant in the exponent term, enabling more reliable model predictions. To ensure stable model training, we leverage a Dirichlet distribution to assign probabilities to predicted labels and introduce a novel evidence deep learning (EDL) loss. The proposed loss function encourages positive and sufficiently large logits for the given label, while penalizing negative and small logits for other labels, leading to more distinct logits and facilitating better example selection based on a large-margin criterion. Through extensive experiments on diverse benchmark datasets, we demonstrate that DPC achieves state-of-the-art performance. The code is available at https://github.com/chenchenzong/DPC.



Paperid:1914
Authors:Xin Zou, Weiwei Liu
Wuhan University, Wuhan University
Abstract:
Outof-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. In this paper, we study the confidence set prediction problem in the OOD generalization setting. Split conformal prediction (SCP) is an efficient framework for handling the confidence set prediction problem. However, the validity of SCP requires the examples to be exchangeable, which is violated in the OOD setting. Empirically, we show that trivially applying SCP results in a failure to maintain the marginal coverage when the unseen target domain is different from the source domain. To address this issue, we develop a method for forming confident prediction sets in the OOD setting and theoretically prove the validity of our method. Finally, we conduct experiments on simulated data to empirically verify the correctness of our theory and the validity of our proposed method.



Paperid:1915
Authors:Xinying Zou, Samir M. Perlaza, Iñaki Esnaola, Eitan Altman
INRIA, Centre Inria d'Université Côte d'Azur, INRIA, Centre Inria d'Université Côte d'Azur Dept. of Electrical and Computer Engineering, Princeton University, Princeton N.J. 08544, USA GAATI Laboratory, Université de la Polynésie Française, Faaa 98702, French Polynesia, Dept. of Automatic Control and Systems Engineering, University of Sheffield, Sheffield S1 3JD, UK Dept. of Electrical and Computer Engineering, Princeton University, Princeton NJ 08544, USA, INRIA, Centre Inria d'Université Côte d'Azur Laboratoire d’Informatique d’Avignon, Université d’Avignon, France
Abstract:
In this paper, the worstcase probability measure over the data is introduced as a tool for characterizing the generalization capabilities of machine learning algorithms. More specifically, the worst-case probability measure is a Gibbs probability measure and the unique solution to the maximization of the expected loss under a relative entropy constraint with respect to a reference probability measure. Fundamental generalization metrics, such as the sensitivity of the expected loss, the sensitivity of the empirical risk, and the generalization gap are shown to have closed-form expressions involving the worst-case data-generating probability measure. Existing results for the Gibbs algorithm, such as characterizing the generalization gap as a sum of mutual information and lautum information, up to a constant factor, are recovered. A novel parallel is established between the worst-case data-generating probability measure and the Gibbs algorithm. Specifically, the Gibbs probability measure is identified as a fundamental commonality of the model space and the data space for machine learning algorithms.



Paperid:1916
Authors:Pedro Zuidberg Dos Martires
Örebro University
Abstract:
Probabilistic circuits (PCs) have gained prominence in recent years as a versatile framework for discussing probabilistic models that support tractable queries and are yet expressive enough to model complex probability distributions. Nevertheless, tractability comes at a cost: PCs are less expressive than neural networks. In this paper we introduce probabilistic neural circuits (PNCs), which strike a balance between PCs and neural nets in terms of tractability and expressive power. Theoretically, we show that PNCs can be interpreted as deep mixtures of Bayesian networks. Experimentally, we demonstrate that PNCs constitute powerful function approximators.



Paperid:1917
Authors:Zain Alabedeen Ali, Konstantin Yakovlev
Moscow Institute of Physics and Technology, Moscow, Russia, Federal Research Center for Computer Science and Control of Russian Academy of Sciences, Moscow, Russia AIRI, Moscow, Russia
Abstract:
We consider an Anonymous MultiAgent Path-Finding (AMAPF) problem where the set of agents is confined to a graph, a set of goal vertices is given and each of these vertices has to be reached by some agent. The problem is to find an assignment of the goals to the agents as well as the collision-free paths, and we are interested in finding the solution with the optimal makespan. A well-established approach to solve this problem is to reduce it to a special type of a graph search problem, i.e. to the problem of finding a maximum flow on an auxiliary graph induced by the input one. The size of the former graph may be very large and the search on it may become a bottleneck. To this end, we suggest a specific search algorithm that leverages the idea of exploring the search space not through considering separate search states but rather bulks of them simultaneously. That is, we implicitly compress, store and expand bulks of the search states as single states, which results in high reduction in runtime and memory. Empirically, the resultant AMAPF solver demonstrates superior performance compared to the state-of-the-art competitor and is able to solve all publicly available MAPF instances from the well-known MovingAI benchmark in less than 30 seconds.



Paperid:1918
Authors:Yanwen Ba, Xuan Liu, Xinning Chen, Hao Wang, Yang Xu, Kenli Li, Shigeng Zhang
Hunan University, Hunan University, Hunan University, Hunan University, Hunan University, Hunan University, Central South University
Abstract:
While decentralized training is attractive in multiagent reinforcement learning (MARL) for its excellent scalability and robustness, its inherent coordination challenges in collaborative tasks result in numerous interactions for agents to learn good policies. To alleviate this problem, action advising methods make experienced agents share their knowledge about what to do, while less experienced agents strictly follow the received advice. However, this method of sharing and utilizing knowledge may hinder the team's exploration of better states, as agents can be unduly influenced by suboptimal or even adverse advice, especially in the early stages of learning. Inspired by the fact that humans can learn not only from the success but also from the failure of others, this paper proposes a novel knowledge sharing framework called Cautiously-Optimistic kNowledge Sharing (CONS). CONS enables each agent to share both positive and negative knowledge and cautiously assimilate knowledge from others, thereby enhancing the efficiency of early-stage exploration and the agents' robustness to adverse advice. Moreover, considering the continuous improvement of policies, agents value negative knowledge more in the early stages of learning and shift their focus to positive knowledge in the later stages. Our framework can be easily integrated into existing Q-learning based methods without introducing additional training costs. We evaluate CONS in several challenging multi-agent tasks and find it excels in environments where optimal behavioral patterns are difficult to discover, surpassing the baselines in terms of convergence rate and final performance.



Paperid:1919
Authors:Raphaël Berthon, Joost-Pieter Katoen, Munyque Mittelmann, Aniello Murano
RWTH Aachen University, RWTH Aachen University, University of Naples Federico II, University of Naples Federico II
Abstract:
Strategies synthesized using formal methods can be complex and often require infinite memory, which does not correspond to the expected behavior when trying to model MultiAgent Systems (MAS). To capture such behaviors, natural strategies are a recently proposed framework striking a balance between the ability of agents to strategize with memory and the complexity of the model-checking problem, but until now has been restricted to fully deterministic settings. For the first time, we consider the probabilistic temporal logics PATL and PATL∗ under natural strategies (NatPATL and NatPATL∗). As main result we show that, in stochastic MAS, NatPATL model-checking is NP-complete when the active coalition is restricted to deterministic strategies. We also give a 2NEXPTIME complexity result for NatPATL∗ with the same restriction. In the unrestricted case, we give an EXPSPACE complexity for NatPATL and 3EXPSPACE complexity for NatPATL*.



Paperid:1920
Authors:Raven Beutner, Bernd Finkbeiner
CISPA Helmholtz Center for Information Security, CISPA Helmholtz Center for Information Security
Abstract:
Alternatingtime temporal logic (ATL*) is a well-established framework for formal reasoning about multi-agent systems. However, while ATL* can reason about the strategic ability of agents (e.g., some coalition A can ensure that a goal is reached eventually), we cannot compare multiple strategic interactions, nor can we require multiple agents to follow the same strategy. For example, we cannot state that coalition A can reach a goal sooner (or more often) than some other coalition A'. In this paper, we propose HyperATL*_S, an extension of ATL* in which we can (1) compare the outcome of multiple strategic interactions w.r.t. a hyperproperty, i.e., a property that refers to multiple paths at the same time, and (2) enforce that some agents share the same strategy. We show that HyperATL*_S is a rich specification language that captures important AI-related properties that were out of reach of existing logics. We prove that model checking of HyperATL*_S on concurrent game structures is decidable. We implement our model-checking algorithm in a tool we call HyMASMC and evaluate it on a range of benchmarks.



Paperid:1921
Authors:Jingdi Chen, Tian Lan, Carlee Joe-Wong
The George Washington University, The George Washington University, Carnegie Mellon University
Abstract:
Communication is crucial for solving cooperative MultiAgent Reinforcement Learning tasks in partially observable Markov Decision Processes. Existing works often rely on black-box methods to encode local information/features into messages shared with other agents, leading to the generation of continuous messages with high communication overhead and poor interpretability. Prior attempts at discrete communication methods generate one-hot vectors trained as part of agents' actions and use the Gumbel softmax operation for calculating message gradients, which are all heuristic designs that do not provide any quantitative guarantees on the expected return. This paper establishes an upper bound on the return gap between an ideal policy with full observability and an optimal partially observable policy with discrete communication. This result enables us to recast multi-agent communication into a novel online clustering problem over the local observations at each agent, with messages as cluster labels and the upper bound on the return gap as clustering loss. To minimize the return gap, we propose the Return-Gap-Minimization Communication (RGMComm) algorithm, which is a surprisingly simple design of discrete message generation functions and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss function, which incorporates cosine-distance as the clustering metric. Evaluations show that RGMComm significantly outperforms state-of-the-art multi-agent communication baselines and can achieve nearly optimal returns with few-bit messages that are naturally interpretable.



Paperid:1922
Authors:Sirui Chen, Zhaowei Zhang, Yaodong Yang, Yali Du
Renmin University of China, Institute for Artificial Intelligence, Peking University National Key Laboratory of General Artificial Intelligence, BIGAI, Institute for Artificial Intelligence, Peking University, King's College London
Abstract:
Centralized Training with Decentralized Execution (CTDE) has been proven to be an effective paradigm in cooperative multiagent reinforcement learning (MARL). One of the major challenges is credit assignment, which aims to credit agents by their contributions. They lack the functionality to model complicated relations of the delayed global reward in the temporal dimension and suffer from inefficiencies. To tackle this, we introduce Spatial-Temporal Attention with Shapley (STAS), a novel method that learns credit assignment in both temporal and spatial dimensions. It first decomposes the global return back to each time step, then utilizes the Shapley Value to redistribute the individual payoff from the decomposed global reward. To mitigate the computational complexity of the Shapley Value, we introduce an approximation of marginal contribution and utilize Monte Carlo sampling to estimate it. We evaluate our method on an Alice & Bob example and MPE environments across different scenarios. Our results demonstrate that our method effectively assigns spatial-temporal credit, outperforming all state-of-the-art baselines.



Paperid:1923
Authors:Shifei Ding, Wei Du, Ling Ding, Lili Guo, Jian Zhang
School of Computer Science and Technology, China University of Mining and Technology Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, School of Computer Science and Technology, China University of Mining and Technology, College of Intelligence and Computing, Tianjin University, School of Computer Science and Technology, China University of Mining and Technology Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, School of Computer Science and Technology, China University of Mining and Technology Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China
Abstract:
Efficient communication learning among agents has been shown crucial for cooperative multiagent reinforcement learning (MARL), as it can promote the action coordination of agents and ultimately improve performance. Graph neural network (GNN) provide a general paradigm for communication learning, which consider agents and communication channels as nodes and edges in a graph, with the action selection corresponding to node labeling. Under such paradigm, an agent aggregates information from neighbor agents, which can reduce uncertainty in local decision-making and induce implicit action coordination. However, this communication paradigm is vulnerable to adversarial attacks and noise, and how to learn robust and efficient communication under perturbations has largely not been studied. To this end, this paper introduces a novel Multi-Agent communication mechanism via Graph Information bottleneck (MAGI), which can optimally balance the robustness and expressiveness of the message representation learned by agents. This communication mechanism is aim at learning the minimal sufficient message representation for an agent by maximizing the mutual information (MI) between the message representation and the selected action, and simultaneously constraining the MI between the message representation and the agent feature. Empirical results demonstrate that MAGI is more robust and efficient than state-of-the-art GNN-based MARL methods.



Paperid:1924
Authors:Wei Du, Shifei Ding, Lili Guo, Jian Zhang, Ling Ding
School of Computer Science and Technology, China University of Mining and Technology, School of Computer Science and Technology, China University of Mining and Technology Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, School of Computer Science and Technology, China University of Mining and Technology Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, School of Computer Science and Technology, China University of Mining and Technology Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, College of Intelligence and Computing, Tianjin University
Abstract:
Information sharing through communication is essential for tackling complex multiagent reinforcement learning tasks. Many existing multi-agent communication protocols can be viewed as instances of message passing graph neural networks (GNNs). However, due to the significantly limited expressive ability of the standard GNN method, the agent feature representations remain similar and indistinguishable even though the agents have different neighborhood structures. This further results in the homogenization of agent behaviors and reduces the capability to solve tasks effectively. In this paper, we propose a multi-agent communication protocol via identity-aware learning (IDEAL), which explicitly enhances the distinguishability of agent feature representations to break the diversity bottleneck. Specifically, IDEAL extends existing multi-agent communication protocols by inductively considering the agents' identities during the message passing process. To obtain expressive feature representations for a given agent, IDEAL first extracts the ego network centered around that agent and then performs multiple rounds of heterogeneous message passing, where different parameter sets are applied to the central agent and the other surrounding agents within the ego network. IDEAL fosters expressive communication between agents and generates distinguishable feature representations, which promotes action diversity and individuality emergence. Experimental results on various benchmarks demonstrate IDEAL can be flexibly integrated into various multi-agent communication methods and enhances the corresponding performance.



Paperid:1925
Authors:Xiao Du, Yutong Ye, Pengyu Zhang, Yaning Yang, Mingsong Chen, Ting Wang
Software Engineering Institute, East China Normal University, Software Engineering Institute, East China Normal University, Software Engineering Institute, East China Normal University, Software Engineering Institute, East China Normal University, Software Engineering Institute, East China Normal University, Software Engineering Institute, East China Normal University
Abstract:
Learning to collaborate has witnessed significant progress in multiagent reinforcement learning (MARL). However, promoting coordination among agents and enhancing exploration capabilities remain challenges. In multi-agent environments, interactions between agents are limited in specific situations. Effective collaboration between agents thus requires a nuanced understanding of when and how agents' actions influence others.To this end, in this paper, we propose a novel MARL algorithm named Situation-Dependent Causal Influence-Based Cooperative Multi-agent Reinforcement Learning (SCIC), which incorporates a novel Intrinsic reward mechanism based on a new cooperation criterion measured by situation-dependent causal influence among agents.Our approach aims to detect inter-agent causal influences in specific situations based on the criterion using causal intervention and conditional mutual information. This effectively assists agents in exploring states that can positively impact other agents, thus promoting cooperation between agents.The resulting update links coordinated exploration and intrinsic reward distribution, which enhance overall collaboration and performance.Experimental results on various MARL benchmarks demonstrate the superiority of our method compared to state-of-the-art approaches.



Paperid:1926
Authors:Yicheng Feng, Boshi An, Zongqing Lu
School of Computer Science, Peking University, School of Computer Science, Peking University, School of Computer Science, Peking University
Abstract:
The study of emergent communication has been dedicated to interactive artificial intelligence. While existing work focuses on communication about single objects or complex image scenes, we argue that communicating relationships between multiple objects is important in more realistic tasks, but understudied. In this paper, we try to fill this gap and focus on emergent communication about positional relationships between two objects. We train agents in the referential game where observations contain two objects, and find that generalization is the major problem when the positional relationship is involved. The key factor affecting the generalization ability of the emergent language is the input variation between Speaker and Listener, which is realized by a random image generator in our work. Further, we find that the learned language can generalize well in a new multistep MDP task where the positional relationship describes the goal, and performs better than raw-pixel images as well as pre-trained image features, verifying the strong generalization ability of discrete sequences. We also show that language transfer from the referential game performs better in the new task than learning language directly in this task, implying the potential benefits of pre-training in referential games. All in all, our experiments demonstrate the viability and merit of having agents learn to communicate positional relationships between multiple objects through emergent communication.



Paperid:1927
Authors:Foivos Fioravantes, Dušan Knop, Jan Matyáš Křištan, Nikolaos Melissinos, Michal Opler
Czech Technical University in Prague, Czech Technical University in Prague, Czech Technical University in Prague, Czech Technical University in Prague, Czech Technical University in Prague
Abstract:
In the Multiagent Path Finding (MAPF for short) problem, we focus on efficiently finding noncolliding paths for a set of k agents on a given graph G, where each agent seeks a path from its source vertex to a target. An important measure of the quality of the solution is the length of the proposed schedule l, that is, the length of a longest path (including the waiting time). In this work, we propose a systematic study under the parameterized complexity framework. The hardness results we provide align with many heuristics used for this problem, whose running time could potentially be improved based on our Fixed-Parameter Tractability (FPT) results. We show that MAPF is W[1]-hard with respect to k (even if k is combined with the maximum degree of the input graph). The problem remains NP-hard in planar graphs even if the maximum degree and the makespan l are fixed constants. On the positive side, we show an FPT algorithm for k+l. As we continue, the structure of G comes into play. We give an FPT algorithm for parameter k plus the diameter of the graph G. The MAPF problem is W[1]-hard for cliquewidth of G plus l while it is FPT for treewidth of G plus l.



Paperid:1928
Authors:Tobias Friedrich, Andreas Göbel, Nicolas Klodt, Martin S. Krejca, Marcus Pappik
Hasso Plattner Institute, Hasso Plattner Institute, Hasso Plattner Institute, LIX, CNRS, Ecole Polytechnique, IPP, Hasso Plattner Institute
Abstract:
Information diffusion models on networks are at the forefront of AI research. The dynamics of such models typically follow stochastic models from epidemiology, used to model not only infections but various phenomena, including the behavior of computer viruses and viral marketing campaigns. A core question in this setting is how to efficiently detect the most influential vertices in the host graph such that the infection survives the longest. In processes that incorporate reinfection of the vertices, such as the SIS process, theoretical studies identify parameter thresholds where the survival time of the process rapidly transitions from logarithmic to super-polynomial. These results contradict the intuition that the starting configuration is relevant, since the process will always either die out fast or survive almost indefinitely. A shortcoming of these results is that models incorporating short-term immunity (or creative advertisement fatigue) have not been subjected to such a theoretical analysis so far. We reduce this gap in the literature by studying the SIRS process, a more realistic model, which besides re-infection additionally incorporates short-term immunity. On complex network models, we identify parameter regimes for which the process survives exponentially long, and we get a tight threshold for random graphs. Underlying these results is our main technical contribution, showing a threshold behavior for the survival time of the SIRS process on graphs with large expander subgraphs, such as social network models.



Paperid:1929
Authors:Yuma Fujimoto, Kaito Ariu, Kenshi Abe
SOKENDAI The University of Tokyo CyberAgent, CyberAgent KTH, CyberAgent The University of Electro-Communications
Abstract:
Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decisionmaking of real agents like humans. However, such games with memory are hard to analyze because they exhibit complex phenomena like chaotic dynamics or divergence from Nash equilibrium. In particular, how asymmetry in memory capacities between agents affects learning in games is still unclear. In response, this study formulates a gradient ascent algorithm in games with asymmetry memory capacities. To obtain theoretical insights into learning dynamics, we first consider a simple case of zero-sum games. We observe complex behavior, where learning dynamics draw a heteroclinic connection from unstable fixed points to stable ones. Despite this complexity, we analyze learning dynamics and prove local convergence to these stable fixed points, i.e., the Nash equilibria. We identify the mechanism driving this convergence: an agent with a longer memory learns to exploit the other, which in turn endows the other's utility function with strict concavity. We further numerically observe such convergence in various initial strategies, action numbers, and memory lengths. This study reveals a novel phenomenon due to memory asymmetry, providing fundamental strides in learning in games and new insights into computing equilibria.



Paperid:1930
Authors:Maris F. L. Galesloot, Thiago D. Simão, Sebastian Junges, Nils Jansen
Radboud University Nijmegen, The Netherlands, Eindhoven University of Technology, The Netherlands, Radboud University Nijmegen, The Netherlands, Radboud University Nijmegen, The Netherlands Ruhr-University Bochum, Germany
Abstract:
In centralized multiagent systems, often modeled as multi-agent partially observable Markov decision processes (MPOMDPs), the action and observation spaces grow exponentially with the number of agents, making the value and belief estimation of single-agent online planning ineffective. Prior work partially tackles value estimation by exploiting the inherent structure of multi-agent settings via so-called coordination graphs. Additionally, belief estimation methods have been improved by incorporating the likelihood of observations into the approximation. However, the challenges of value estimation and belief estimation have only been tackled individually, which prevents existing methods from scaling to settings with many agents. Therefore, we address these challenges simultaneously. First, we introduce weighted particle filtering to a sample-based online planner for MPOMDPs. Second, we present a scalable approximation of the belief. Third, we bring an approach that exploits the typical locality of agent interactions to novel online planning algorithms for MPOMDPs operating on a so-called sparse particle filter tree. Our experimental evaluation against several state-of-the-art baselines shows that our methods (1) are competitive in settings with only a few agents and (2) improve over the baselines in the presence of many agents.



Paperid:1931
Authors:Luca Geatti, Marco Montali, Andrey Rivkin
University of Udine, Free University of Bozen-Bolzan Italy, Technical University of Denmark
Abstract:
Given a specification of Lineartime Temporal Logic interpreted over finite traces (LTLf), the reactive synthesis problem asks to find a finitely-representable, terminating controller that reacts to the uncontrollable actions of an environment in order to enforce a desired system specification. In this paper we study, for the first time, the foundations of reactive synthesis for DECLARE, a well-established declarative, pattern-based business process modelling language grounded in LTLf. We provide a threefold contribution. First, we define a reactive synthesis problem for DECLARE. Second, we show how an arbitrary DECLARE specification can be polynomially encoded into an equivalent pure-past one in LTLf, and exploit this to define an EXPTIME algorithm for DECLARE synthesis. Third, we derive a symbolic version of this algorithm, by introducing a novel translation of pure-past temporal formulas into symbolic deterministic finite automata.



Paperid:1932
Authors:Minbiao Han, Michael Albert, Haifeng Xu
Department of Computer Science, The University of Chicago, Darden Business School, University of Virginia, Department of Computer Science, The University of Chicago
Abstract:
We study a ubiquitous learning challenge in online principalagent problems during which the principal learns the agent's private information from the agent's revealed preferences in historical interactions. This paradigm includes important special cases such as pricing and contract design, which have been widely studied in recent literature. However, existing work considers the case where the principal can only choose a single strategy at every round to interact with the agent and then observe the agent's revealed preference through their actions. In this paper, we extend this line of study to allow the principal to offer a menu of strategies to the agent and learn additionally from observing the agent's selection from the menu. We provide a thorough investigation of several online principal-agent problem settings and characterize their sample complexities, accompanied by the corresponding algorithms we have developed. We instantiate this paradigm to several important design problems — including Stackelberg (security) games, contract design, and information design. Finally, we also explore the connection between our findings and existing results about online learning in Stackelberg games, and we offer a solution that can overcome a key hard instance of previous work.



Paperid:1933
Authors:Aamal Hussain, Francesco Belardinelli
Imperial College London, Imperial College London
Abstract:
The behaviour of multi agent learning in competitive network games is often studied within the context of zero sum games, in which convergence guarantees may be obtained. However, outside of this class the behaviour of learning is known to display complex behaviours and convergence cannot be always guaranteed. Nonetheless, in order to develop a complete picture of the behaviour of multi agent learning in competitive settings, the zero sum assumption must be lifted. Motivated by this we study the Q Learning dynamics, a popular model of exploration and exploitation in multi agent learning, in competitive network games. We determine how the degree of competition, exploration rate and network connectivity impact the convergence of Q Learning. To study generic competitive games, we parameterise network games in terms of correlations between agent payoffs and study the average behaviour of the Q Learning dynamics across all games drawn from a choice of this parameter. This statistical approach establishes choices of parameters for which Q Learning dynamics converge to a stable fixed point. Differently to previous works, we find that the stability of Q Learning is explicitly dependent only on the network connectivity rather than the total number of agents. Our experiments validate these findings and show that, under certain network structures, the total number of agents can be increased without increasing the likelihood of unstable or chaotic behaviours.



Paperid:1934
Authors:Haobin Jiang, Ziluo Ding, Zongqing Lu
Peking University, Peking University Beijing Academy of Artificial Intelligence, Peking University
Abstract:
Exploration in decentralized cooperative multiagent reinforcement learning faces two challenges. One is that the novelty of global states is unavailable, while the novelty of local observations is biased. The other is how agents can explore in a coordinated way. To address these challenges, we propose MACE, a simple yet effective multi-agent coordinated exploration method. By communicating only local novelty, agents can take into account other agents' local novelty to approximate the global novelty. Further, we newly introduce weighted mutual information to measure the influence of one agent's action on other agents' accumulated novelty. We convert it as an intrinsic reward in hindsight to encourage agents to exert more influence on other agents' exploration and boost coordinated exploration. Empirically, we show that MACE achieves superior performance in three multi-agent environments with sparse rewards.



Paperid:1935
Authors:Chao Li, Yupeng Zhang, Jianqi Wang, Yujing Hu, Shaokang Dong, Wenbin Li, Tangjie Lv, Changjie Fan, Yang Gao
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, Alibaba DAMO Academy, Hangzhou, China, Meituan, Beijing, China, NetEase Fuxi AI Lab, Hangzhou, China, State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China, NetEase Fuxi AI Lab, Hangzhou, China, NetEase Fuxi AI Lab, Hangzhou, China, State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Abstract:
In cooperative multiagent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness.



Paperid:1936
Authors:Huiqun Li, Hanhan Zhou, Yifei Zou, Dongxiao Yu, Tian Lan
Shandong University, The George Washington University, Shandong University, Shandong University, The George Washington University
Abstract:
Value function factorization has achieved great success in multiagent reinforcement learning by optimizing joint action-value functions through the maximization of factorized per-agent utilities. To ensure Individual-Global-Maximum property, existing works often focus on value factorization using monotonic functions, which are known to result in restricted representation expressiveness. In this paper, we analyze the limitations of monotonic factorization and present ConcaveQ, a novel non-monotonic value function factorization approach that goes beyond monotonic mixing functions and employs neural network representations of concave mixing functions. Leveraging the concave property in factorization, an iterative action selection scheme is developed to obtain optimal joint actions during training. It is used to update agents’ local policy networks, enabling fully decentralized execution. The effectiveness of the proposed ConcaveQ is validated across scenarios involving multi-agent predator-prey environment and StarCraft II micromanagement tasks. Empirical results exhibit significant improvement of ConcaveQ over state-of-the-art multi-agent reinforcement learning approaches.



Paperid:1937
Authors:Pengdeng Li, Runsheng Yu, Xinrun Wang, Bo An
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Hong Kong University of Science and Technology, Hong Kong, China, School of Computer Science and Engineering, Nanyang Technological University, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore
Abstract:
Many realworld scenarios including fleet management and Ad auctions can be modeled as Stackelberg mean-field games (SMFGs) where a leader aims to incentivize a large number of homogeneous self-interested followers to maximize her utility. Existing works focus on cases with a small number of heterogeneous followers, e.g., 5-10, and suffer from scalability issue when the number of followers increases. There are three major challenges in solving large-scale SMFGs: i) classical methods based on solving differential equations fail as they require exact dynamics parameters, ii) learning by interacting with environment is data-inefficient, and iii) complex interaction between the leader and followers makes the learning performance unstable. We address these challenges through transition-informed reinforcement learning. Our main contributions are threefold: i) we first propose an RL framework, the Stackelberg mean-field update, to learn the leader's policy without priors of the environment, ii) to improve the data efficiency and accelerate the learning process, we then propose the Transition-Informed Reinforcement Learning (TIRL) by leveraging the instantiated empirical Fokker-Planck equation, and iii) we develop a regularized TIRL by employing various regularizers to alleviate the sensitivity of the learning performance to the initialization of the leader's policy. Extensive experiments on fleet management and food gathering demonstrate that our approach can scale up to 100,000 followers and significantly outperform existing baselines.



Paperid:1938
Authors:Zhenwei Lin, Jingfan Xia, Qi Deng, Luo Luo
Shanghai University of Finance and Economics, Shanghai University of Finance and Economics, Shanghai University of Finance and Economics, Fudan University
Abstract:
We consider decentralized gradientfree optimization of minimizing Lipschitz continuous functions that satisfy neither smoothness nor convexity assumption. We propose two novel gradient-free algorithms, the Decentralized Gradient-Free Method (DGFM) and its variant, the Decentralized Gradient-Free Method+ (DGFM+). Based on the techniques of randomized smoothing and gradient tracking, DGFM requires the computation of the zeroth-order oracle of a single sample in each iteration, making it less demanding in terms of computational resources for individual computing nodes. Theoretically, DGFM achieves a complexity of O(d^(3/2)δ^(-1)ε^(-4)) for obtaining a (δ,ε)-Goldstein stationary point. DGFM+, an advanced version of DGFM, incorporates variance reduction to further improve the convergence behavior. It samples a mini-batch at each iteration and periodically draws a larger batch of data, which improves the complexity to O(d^(3/2)δ^(-1)ε^(-3)). Moreover, experimental results underscore the empirical advantages of our proposed algorithms when applied to real-world datasets.



Paperid:1939
Authors:Zeyang Liu, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, Xuguang Lan
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Effective exploration is crucial to discovering optimal strategies for multiagent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.



Paperid:1940
Authors:Xingzhou Lou, Junge Zhang, Timothy J. Norman, Kaiqi Huang, Yali Du
School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences, University of Southampton, School of Artificial Intelligence, University of Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences, King's College London
Abstract:
MultiAgent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent's policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology.



Paperid:1941
Authors:Xiangrui Meng, Ying Tan
School of Intelligence Science and Technology, Peking University Key Laboratory of Machine Perceptron (MOE), Peking University, School of Intelligence Science and Technology, Peking University Key Laboratory of Machine Perceptron (MOE), Peking University Institute for Artificial Intelligence, Peking University National Key Laboratory of General Artificial Intelligence
Abstract:
Communication plays a crucial role in information sharing within the field of multiagent reinforcement learning (MARL). However, how to transmit information that meets individual needs remains a long-standing challenge. Some existing work focus on using a common channel for information transfer, which limits the capability for local communication. Meanwhile, other work attempt to establish peer-to-peer communication topologies but suffer from quadratic complexity. In this paper, we propose Personalized Multi-Agent Communication (PMAC), which enables the formation of peer-to-peer communication topologies, personalized message sending, and personalized message receiving. All these modules in PMAC are performed using only multilayer perceptrons (MLPs) with linear computational complexity. Empirically, we show the strength of personalized communication in a variety of cooperative scenarios. Our approach exhibits competitive performance compared to existing methods while maintaining notable computational efficiency.



Paperid:1942
Authors:Thomy Phan, Taoan Huang, Bistra Dilkina, Sven Koenig
University of Southern California, University of Southern California, University of Southern California, University of Southern California
Abstract:
Anytime multiagent path finding (MAPF) is a promising approach to scalable path optimization in large-scale multi-agent systems. State-of-the-art anytime MAPF is based on Large Neighborhood Search (LNS), where a fast initial solution is iteratively optimized by destroying and repairing a fixed number of parts, i.e., the neighborhood of the solution, using randomized destroy heuristics and prioritized planning. Despite their recent success in various MAPF instances, current LNS-based approaches lack exploration and flexibility due to greedy optimization with a fixed neighborhood size which can lead to low-quality solutions in general. So far, these limitations have been addressed with extensive prior effort in tuning or offline machine learning beyond actual planning. In this paper, we focus on online learning in LNS and propose Bandit-based Adaptive LArge Neighborhood search Combined with Exploration (BALANCE). BALANCE uses a bi-level multi-armed bandit scheme to adapt the selection of destroy heuristics and neighborhood sizes on the fly during search. We evaluate BALANCE on multiple maps from the MAPF benchmark set and empirically demonstrate performance improvements of at least 50% compared to state-of-the-art anytime MAPF in large-scale scenarios. We find that Thompson Sampling performs particularly well compared to alternative multi-armed bandit algorithms.



Paperid:1943
Authors:Muhammad Rahman, Jiaxun Cui, Peter Stone
The University of Texas at Austin, The University of Texas at Austin, The University of Texas at Austin Sony AI
Abstract:
Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse teammate policies obtained through maximizing specific diversity metrics. However, prior heuristicbased diversity metrics do not always maximize the agent's robustness in all cooperative problems. In this work, we first propose that maximizing an AHT agent's robustness requires it to emulate policies in the minimum coverage set (MCS), the set of best-response policies to any partner policies in the environment. We then introduce the L-BRDiv algorithm that generates a set of teammate policies that, when used for AHT training, encourage agents to emulate policies from the MCS. L-BRDiv works by solving a constrained optimization problem to jointly train teammate policies for AHT training and approximating AHT agent policies that are members of the MCS. We empirically demonstrate that L-BRDiv produces more robust AHT agents than state-of-the-art methods in a broader range of two-player cooperative problems without the need for extensive hyperparameter tuning for its objectives. Our study shows that L-BRDiv outperforms the baseline methods by prioritizing discovering distinct members of the MCS instead of repeatedly finding redundant policies.



Paperid:1944
Authors:Alexey Skrynnik, Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov
AIRI, Moscow, Russia Federal Research Center for Computer Science and Control of Russian Academy of Sciences, Moscow, Russia, AIRI, Moscow, Russia, AIRI, Moscow, Russia Federal Research Center for Computer Science and Control of Russian Academy of Sciences, Moscow, Russia, AIRI, Moscow, Russia Federal Research Center for Computer Science and Control of Russian Academy of Sciences, Moscow, Russia
Abstract:
The MultiAgent Pathfinding (MAPF) problem involves finding a set of conflict-free paths for a group of agents confined to a graph. In typical MAPF scenarios, the graph and the agents' starting and ending vertices are known beforehand, allowing the use of centralized planning algorithms. However, in this study, we focus on the decentralized MAPF setting, where the agents may observe the other agents only locally and are restricted in communications with each other. Specifically, we investigate the lifelong variant of MAPF, where new goals are continually assigned to the agents upon completion of previous ones. Drawing inspiration from the successful AlphaZero approach, we propose a decentralized multi-agent Monte Carlo Tree Search (MCTS) method for MAPF tasks. Our approach utilizes the agent's observations to recreate the intrinsic Markov decision process, which is then used for planning with a tailored for multi-agent tasks version of neural MCTS. The experimental results show that our approach outperforms state-of-the-art learnable MAPF solvers. The source code is available at https://github.com/AIRI-Institute/mats-lp.



Paperid:1945
Authors:Alexey Skrynnik, Anton Andreychuk, Maria Nesterova, Konstantin Yakovlev, Aleksandr Panov
AIRI, Moscow, Russia Federal Research Center for Computer Science and Control of Russian Academy of Sciences, Moscow, Russia, AIRI, Moscow, Russia, Federal Research Center for Computer Science and Control of Russian Academy of Sciences, Moscow, Russia MIPT, Dolgoprudny, Russia, Federal Research Center for Computer Science and Control of Russian Academy of Sciences, Moscow, Russia AIRI, Moscow, Russia, AIRI, Moscow, Russia MIPT, Dolgoprudny, Russia
Abstract:
Multiagent Pathfinding (MAPF) problem generally asks to find a set of conflict-free paths for a set of agents confined to a graph and is typically solved in a centralized fashion. Conversely, in this work, we investigate the decentralized MAPF setting, when the central controller that possesses all the information on the agents' locations and goals is absent and the agents have to sequentially decide the actions on their own without having access to the full state of the environment. We focus on the practically important lifelong variant of MAPF, which involves continuously assigning new goals to the agents upon arrival to the previous ones. To address this complex problem, we propose a method that integrates two complementary approaches: planning with heuristic search and reinforcement learning through policy optimization. Planning is utilized to construct and re-plan individual paths. We enhance our planning algorithm with a dedicated technique tailored to avoid congestion and increase the throughput of the system. We employ reinforcement learning to discover the collision avoidance policies that effectively guide the agents along the paths. The policy is implemented as a neural network and is effectively trained without any reward-shaping or external guidance. We evaluate our method on a wide range of setups comparing it to the state-of-the-art solvers. The results show that our method consistently outperforms the learnable competitors, showing higher throughput and better ability to generalize to the maps that were unseen at the training stage. Moreover our solver outperforms a rule-based one in terms of throughput and is an order of magnitude faster than a state-of-the-art search-based solver. The code is available at https://github.com/AIRI-Institute/learn-to-follow.



Paperid:1946
Authors:Wanfang Su, Lixing Chen, Yang Bai, Xi Lin, Gaolei Li, Zhe Qu, Pan Zhou
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China, School of Computer Science and Engineering, Central South University, Changsha, China, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science of Technology, Wuhan, China
Abstract:
Multiagent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view (i.e., post-collaboration feature) and its underlying relationship to individual views (i.e., pre-collaboration features), which were treated as an opaque procedure by most existing works. We propose a novel framework named CMiMC (Contrastive Mutual Information Maximization for Collaborative Perception) for intermediate collaboration. The core philosophy of CMiMC is to preserve discriminative information of individual views in the collaborative view by maximizing mutual information between pre- and post-collaboration features while enhancing the efficacy of collaborative views by minimizing the loss function of downstream tasks. In particular, we define multi-view mutual information (MVMI) for intermediate collaboration that evaluates correlations between collaborative views and individual views on both global and local scales. We establish CMiMNet based on multi-view contrastive learning to realize estimation and maximization of MVMI, which assists the training of a collaborative encoder for voxel-level feature fusion. We evaluate CMiMC on V2X-Sim 1.0, and it improves the SOTA average precision by 3.08% and 4.44% at 0.5 and 0.7 IoU (Intersection-over-Union) thresholds, respectively. In addition, CMiMC can reduce communication volume to 1/32 while achieving performance comparable to SOTA. Code and Appendix are released at https://github.com/77SWF/CMiMC.



Paperid:1947
Authors:Yifan Su, Rishi Veerapaneni, Jiaoyang Li
Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University
Abstract:
The MultiAgent Path Finding (MAPF) problem involves planning collision-free paths for multiple agents in a shared environment. The majority of MAPF solvers rely on the assumption that an agent can arrive at a specific location at a specific timestep. However, real-world execution uncertainties can cause agents to deviate from this assumption, leading to collisions and deadlocks. Prior research solves this problem by having agents follow a Temporal Plan Graph (TPG), enforcing a consistent passing order at every location as defined in the MAPF plan. However, we show that TPGs are overly strict because, in some circumstances, satisfying the passing order requires agents to wait unnecessarily, leading to longer execution time. To overcome this issue, we introduce a new graphical representation called a Bidirectional Temporal Plan Graph (BTPG), which allows switching passing orders during execution to avoid unnecessary waiting time. We design two anytime algorithms for constructing a BTPG: BTPG-naïve and BTPG-optimized. Experimental results show that following BTPGs consistently outperforms following TPGs, reducing unnecessary waits by 8-20%.



Paperid:1948
Authors:Jingtao Tang, Hang Ma
Simon Fraser University, Simon Fraser University
Abstract:
We study graphbased Multi-Robot Coverage Path Planning (MCPP) that aims to compute coverage paths for multiple robots to cover all vertices of a given 2D grid terrain graph G. Existing graph-based MCPP algorithms first compute a tree cover on G---a forest of multiple trees that cover all vertices---and then employ the Spanning Tree Coverage (STC) paradigm to generate coverage paths on the decomposed graph D of the terrain graph G by circumnavigating the edges of the computed trees, aiming to optimize the makespan (i.e., the maximum coverage path cost among all robots). In this paper, we take a different approach by exploring how to systematically search for good coverage paths directly on D. We introduce a new algorithmic framework, called LS-MCPP, which leverages a local search to operate directly on D. We propose a novel standalone paradigm, Extended-STC (ESTC), that extends STC to achieve complete coverage for MCPP on any decomposed graph, even those resulting from incomplete terrain graphs. Furthermore, we demonstrate how to integrate ESTC with three novel types of neighborhood operators into our framework to effectively guide its search process. Our extensive experiments demonstrate the effectiveness of LS-MCPP, consistently improving the initial solution returned by two state-of-the-art baseline algorithms that compute suboptimal tree covers on G, with a notable reduction in makespan by up to 35.7% and 30.3%, respectively. Moreover, LS-MCPP consistently matches or surpasses the results of optimal tree cover computation, achieving these outcomes with orders of magnitude faster runtime, thereby showcasing its significant benefits for large-scale real-world coverage tasks.



Paperid:1949
Authors:Lebin Yu, Yunbo Qiu, Quanming Yao, Yuan Shen, Xudong Zhang, Jian Wang
Tsinghua university, Tsinghua university, Tsinghua university, Tsinghua university, Tsinghua university, Tsinghua university
Abstract:
Communication in multiagent reinforcement learning (MARL) has been proven to effectively promote cooperation among agents recently. Since communication in real-world scenarios is vulnerable to noises and adversarial attacks, it is crucial to develop robust communicative MARL technique. However, existing research in this domain has predominantly focused on passive defense strategies, where agents receive all messages equally, making it hard to balance performance and robustness. We propose an active defense strategy, where agents automatically reduce the impact of potentially harmful messages on the final decision. There are two challenges to implement this strategy, that are defining unreliable messages and adjusting the unreliable messages' impact on the final decision properly. To address them, we design an Active Defense Multi-Agent Communication framework (ADMAC), which estimates the reliability of received messages and adjusts their impact on the final decision accordingly with the help of a decomposable decision structure. The superiority of ADMAC over existing methods is validated by experiments in three communication-critical tasks under four types of attacks.



Paperid:1950
Authors:Xin Yu, Rongye Shi, Pu Feng, Yongkai Tian, Simin Li, Shuhao Liao, Wenjun Wu
Beihang University, Beihang University, Beihang University, Beihang University, Beihang University, Beihang University, Beihang University
Abstract:
Incorporating symmetry as an inductive bias into multiagent reinforcement learning (MARL) has led to improvements in generalization, data efficiency, and physical consistency. While prior research has succeeded in using perfect symmetry prior, the realm of partial symmetry in the multi-agent domain remains unexplored. To fill in this gap, we introduce the partially symmetric Markov game, a new subclass of the Markov game. We then theoretically show that the performance error introduced by utilizing symmetry in MARL is bounded, implying that the symmetry prior can still be useful in MARL even in partial symmetry situations. Motivated by this insight, we propose the Partial Symmetry Exploitation (PSE) framework that is able to adaptively incorporate symmetry prior in MARL under different symmetry-breaking conditions. Specifically, by adaptively adjusting the exploitation of symmetry, our framework is able to achieve superior sample efficiency and overall performance of MARL algorithms. Extensive experiments are conducted to demonstrate the superior performance of the proposed framework over baselines. Finally, we implement the proposed framework in real-world multi-robot testbed to show its superiority.



Paperid:1951
Authors:Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, Feng Yin, Yitao Liang, Yaodong Yang
SSE, The Chinese University of Hong Kong, Shenzhen Institute for Artificial Intelligence, Peking University, Institute of Automation, Chinese Academy of Sciences, ReLER, AAII, University of Technology Sydney, Institute for Artificial Intelligence, Peking University National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), Institute for Artificial Intelligence, Peking University, Institute for Artificial Intelligence, Peking University, Institute for Artificial Intelligence, Peking University, Institute for Artificial Intelligence, Peking University National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), Institute for Artificial Intelligence, Peking University, National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), ReLER, AAII, University of Technology Sydney, Institute of Automation, Chinese Academy of Sciences, SSE, The Chinese University of Hong Kong, Shenzhen, Institute for Artificial Intelligence, Peking University, Institute for Artificial Intelligence, Peking University
Abstract:
Building agents with adaptive behavior in cooperative tasks stands as a paramount goal in the realm of multiagent systems. Current approaches to developing cooperative agents rely primarily on learning-based methods, whose policy generalization depends heavily on the diversity of teammates they interact with during the training phase. Such reliance, however, constrains the agents' capacity for strategic adaptation when cooperating with unfamiliar teammates, which becomes a significant challenge in zero-shot coordination scenarios. To address this challenge, we propose ProAgent, a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with teammates. ProAgent can analyze the present state, and infer the intentions of teammates from observations. It then updates its beliefs in alignment with the teammates' subsequent actual behaviors. Moreover, ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various of coordination scenarios. Experimental evaluations conducted within the Overcooked-AI environment unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training when cooperating with AI agents. Furthermore, in partnered with human proxy models, its performance exhibits an average improvement exceeding 10% compared to the current state-of-the-art method. For more information about our project, please visit https://pku-proagent.github.io.



Paperid:1952
Authors:Junkai Zhang, Yifan Zhang, Xi Sheryl Zhang, Yifan Zang, Jian Cheng
Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Nanjing Nanjing Artificial Intelligence Research of AI, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Nanjing Nanjing Artificial Intelligence Research of AI, Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Nanjing Nanjing Artificial Intelligence Research of AI
Abstract:
Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multiagent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of adequate team consensus-related guidance signals during credit assignment in CTDE. To address this, we propose Intrinsic Action Tendency Consistency, a novel approach for cooperative multi-agent reinforcement learning. It integrates intrinsic rewards, obtained through an action model, into a reward-additive CTDE (RA-CTDE) framework. We formulate an action model that enables surrounding agents to predict the central agent's action tendency. Leveraging these predictions, we compute a cooperative intrinsic reward that encourages agents to align their actions with their neighbors' predictions. We establish the equivalence between RA-CTDE and CTDE through theoretical analyses, demonstrating that CTDE's training process can be achieved using N individual targets. Building on this insight, we introduce a novel method to combine intrinsic rewards and RA-CTDE. Extensive experiments on challenging tasks in SMAC, MPE, and GRF benchmarks showcase the improved performance of our method.



Paperid:1953
Authors:Enshuai Zhou, Yifan Hao, Rui Zhang, Yuxuan Guo, Zidong Du, Xishan Zhang, Xinkai Song, Chao Wang, Xuehai Zhou, Jiaming Guo, Qi Yi, Shaohui Peng, Di Huang, Ruizhi Chen, Qi Guo, Yunji Chen
University of Science and Technology of China State Key Lab of Processors, Institute of Computing Technology, CAS Cambricon Technologies, State Key Lab of Processors, Institute of Computing Technology, CAS, State Key Lab of Processors, Institute of Computing Technology, CAS, University of Science and Technology of China State Key Lab of Processors, Institute of Computing Technology, CAS Cambricon Technologies, State Key Lab of Processors, Institute of Computing Technology, CAS Shanghai Innovation Center for Processor Technologies, State Key Lab of Processors, Institute of Computing Technology, CAS Cambricon Technologies, State Key Lab of Processors, Institute of Computing Technology, CAS, University of Science and Technology of China, University of Science and Technology of China, State Key Lab of Processors, Institute of Computing Technology, CAS, University of Science and Technology of China State Key Lab of Processors, Institute of Computing Technology, CAS Cambricon Technologies, Intelligent Software Research Center, Institute of Software, CAS, State Key Lab of Processors, Institute of Computing Technology, CAS, Intelligent Software Research Center, Institute of Software, CAS, State Key Lab of Processors, Institute of Computing Technology, CAS, State Key Lab of Processors, Institute of Computing Technology, CAS University of Chinese Academy of Sciences
Abstract:
Research on emergent communication has recently gained significant traction as a promising avenue for the linguistic community to unravel human language's origins and explore artificial intelligence's generalization capabilities. Current research has predominantly concentrated on recognizing qualitative patterns of object attributes(e.g., shape and color) and paid little attention to the quantitative relationship among object quantities which is known as the part of numerical concepts. The ability to generalize numerical concepts, i.e., counting and calculations with unseen quantities, is essential, as it mirrors humans' foundational abstract reasoning abilities. In this work, we introduce the NumGame, leveraging the referential game framework, forcing agents to communicate and generalize the numerical concepts effectively. Inspired by the human learning process of numbers, we present a twostage training approach that sequentially fosters a rudimentary numerical sense followed by the ability of arithmetic calculation, ultimately aiding agents in generating semantically stable and unambiguous language for numerical concepts. The experimental results indicate the impressive generalization capabilities to unseen quantities and regularity of the language emergence from communication.



Paperid:1954
Authors:Chenyang Zhu, Wen Si, Jinyu Zhu, Zhihao Jiang
Changzhou University, Changzhou University, Changzhou University, ShanghaiTech University
Abstract:
The increasing demands for system complexity and robustness have prompted the integration of temporal logic into MultiAgent Reinforcement Learning (MARL) to address tasks with non-Markovian properties. However, incorporating non-Markovian properties introduces additional computational complexities, as agents are required to integrate historical data into their decision-making process. Also, optimizing strategies within a multi-agent environment presents significant challenges due to the exponential growth of the state space with the number of agents. In this study, we introduce an innovative hierarchical MARL framework that synthesizes temporal equilibrium strategies through parity games and subsequently encodes them as individual reward machines for MARL coordination. More specifically, we reduce the strategy synthesis problem into an emptiness problem concerning parity games with optimized states and transitions. Following this synthesis step, the temporal equilibrium strategy is decomposed into individual reward machines for decentralized MARL. Theoretical proofs are provided to verify the consistency of the Nash equilibrium between the parallel composition of decomposed strategies and the original strategy. Empirical evidence confirms the efficacy of the proposed synthesis technique, showcasing its ability to reduce state space compared to the state-of-the-art tool. Furthermore, our study highlights the superior performance of the distributed MARL paradigm over centralized approaches when deploying decomposed strategies.



Paperid:1955
Authors:Rui Zou, Sannyuya Liu, Yawei Luo, Yaqi Liu, Jintian Feng, Mengqi Wei, Jianwen Sun
Central China Normal University, Central China Normal University, Zhejiang University, Zhongnan University of Economics and Law, Central China Normal University, Central China Normal University, Central China Normal University
Abstract:
In the evolving artificial intelligence domain, hybrid humanmachine systems have emerged as a transformative research area. While many studies have concentrated on individual human-machine interactions, there is a lack of focus on multi-human and multi-machine dynamics. This paper delves into these nuances by introducing a novel statistical framework that discerns integration accuracy in terms of precision and diversity. Empirical studies reveal that performance surges consistently with scale, either in human or machine settings. However, hybrid systems present complexities. Their performance is intricately tied to the human-to-machine ratio. Interestingly, as the scale expands, integration performance growth isn't limitless. It reaches a threshold influenced by model diversity. This introduces a pivotal `knee point', signifying the optimal balance between performance and scale. This knowledge is vital for resource allocation in practical applications. Grounded in rigorous evaluations using public datasets, our findings emphasize the framework's robustness in refining integrated systems.



Paperid:1956
Authors:Chaoyi Ai, Kewei Tu
School of Information Science and Technology, ShanghaiTech University Shanghai Engineering Research Center of Intelligent Vision and Imaging, School of Information Science and Technology, ShanghaiTech University Shanghai Engineering Research Center of Intelligent Vision and Imaging
Abstract:
This paper presents an approach to frame semantic role labeling (FSRL), a task in natural language processing that identifies semantic roles within a text following the theory of frame semantics. Unlike previous approaches which do not adequately model correlations and interactions amongst arguments, we propose arbitraryorder conditional random fields (CRFs) that are capable of modeling full interaction amongst an arbitrary number of arguments of a given predicate. To achieve tractable representation and inference, we apply canonical polyadic decomposition to the arbitrary-order factor in our proposed CRF and utilize mean-field variational inference for approximate inference. We further unfold our iterative inference procedure into a recurrent neural network that is connected to our neural encoder and scorer, enabling end-to-end training and inference. Finally, we also improve our model with several techniques such as span-based scoring and decoding. Our experiments show that our approach achieves state-of-the-art performance in FSRL.



Paperid:1957
Authors:Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson
University of Surrey, University of Surrey, University of Surrey, University of Surrey, University of Surrey
Abstract:
Convolutional neural networks (CNNs) and Transformerbased networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git



Paperid:1958
Authors:Kenichiro Ando, Satoshi Sekine, Mamoru Komachi
RIKEN AIP, RIKEN AIP, Hitotsubashi University
Abstract:
Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poorquality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of English Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, by performing human annotation, we found that the model we developed performed better than the crowdsourced workers. WikiSQE is expected to be a valuable resource for other tasks in NLP.



Paperid:1959
Authors:Pragyan Banerjee, Abhinav Java, Surgan Jandial, Simra Shahid, Shaz Furniturewala, Balaji Krishnamurthy, Sumit Bhatia
Indian Institute of Technology Guwahati, MDSR Labs, Adobe, MDSR Labs, Adobe, MDSR Labs, Adobe, Birla Institute of Technology and Science, Pilani, MDSR Labs, Adobe, MDSR Labs, Adobe
Abstract:
Fairness in Language Models (LMs) remains a longstanding challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates/exemplars. Regardless, they don’t address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model’s understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability.



Paperid:1960
Authors:Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler
ETH Zurich, ETH Zurich, ETH Zurich, ETH Zurich, Warsaw University of Technology, ETH Zurich, Cledar, Cledar, Cledar, Cledar, ETH Zurich
Abstract:
We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chainof-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks



Paperid:1961
Authors:Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail Burtsev
MIPT, AIRI MIPT, AIRI, LIMS
Abstract:
A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pretrained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.



Paperid:1962
Authors:Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, Liang He
East China Normal University, East China Normal University Shanghai Artificial Intelligence Laboratory, East China Normal University, Hasso Plattner Institute University of Potsdam, Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory, East China Normal University
Abstract:
The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be timeconsuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establish- ing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.



Paperid:1963
Authors:Yuang Cai, Yuyu Yuan
Beijing University of Posts and Telecommunications Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing University of Posts and Telecommunications Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education
Abstract:
CrossLingual Summarization (CLS) involves generating a summary for a given document in another language. Most of the existing approaches adopt multi-task training and knowledge distillation, which increases the training cost and improves the performance of CLS tasks intuitively but unexplainably. In this work, we propose Cross-Attention Reinforcement (CAR) module and incorporate the module into the transformer backbone to formulate the CAR-Transformer. The CAR module formulates a pseudo summarization policy parameterized by the cross-attention weights reinforced by the ground-truth monolingual summary without introducing extra model parameters. Our approach demonstrates more consistent improvement across CLS tasks compared to traditional multi-task training methods and outperforms the fine-tuned vanilla mBART by 3.67 and the best-performing multi-task training approach by 1.48 in ROUGE-L F1 score on the WikiLingua Korean-to-English CLS task.



Paperid:1964
Authors:Yuyang Chai, Zhuang Li, Jiahui Liu, Lei Chen, Fei Li, Donghong Ji, Chong Teng
Wuhan University, Monash University, Wuhan University, Wuhan University, Wuhan University, Wuhan University, Wuhan University
Abstract:
Despite significant advancements in multilabel text classification, the ability of existing models to generalize to novel and seldom-encountered complex concepts, which are compositions of elementary ones, remains underexplored. This research addresses this gap. By creating unique data splits across three benchmarks, we assess the compositional generalization ability of existing multi-label text classification models. Our results show that these models often fail to generalize to compositional concepts encountered infrequently during training, leading to inferior performance on tests with these new combinations. To address this, we introduce a data augmentation method that leverages two innovative text generation models designed to enhance the classification models' capacity for compositional generalization. Our experiments show that this data augmentation approach significantly improves the compositional generalization capabilities of classification models on our benchmarks, with both generation models surpassing other text generation baselines. Our codes available at https://github.com/yychai74/LD-VAE.



Paperid:1965
Authors:Mingshan Chang, Min Yang, Qingshan Jiang, Ruifeng Xu
Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Harbin Institute of Technology (Shenzhen)
Abstract:
Despite having achieved notable success for aspectbased sentiment analysis (ABSA), deep neural networks are susceptible to spurious correlations between input features and output labels, leading to poor robustness. In this paper, we propose a novel Counterfactual-Enhanced Information Bottleneck framework (called CEIB) to reduce spurious correlations for ABSA. CEIB extends the information bottleneck (IB) principle to a factual-counterfactual balancing setting by integrating augmented counterfactual data, with the goal of learning a robust ABSA model. Concretely, we first devise a multi-pattern prompting method, which utilizes the large language model (LLM) to generate high-quality counterfactual samples from the original samples. Then, we employ the information bottleneck principle and separate the mutual information into factual and counterfactual parts. In this way, we can learn effective and robust representations for the ABSA task by balancing the predictive information of these two parts. Extensive experiments on five benchmark ABSA datasets show that our CEIB approach achieves superior prediction performance and robustness over the state-of-the-art baselines. Code and data to reproduce the results in this paper is available at: https://github.com/shesshan/CEIB.



Paperid:1966
Authors:Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang
Xiaobing.AI Hong Kong University of Science and Technology, Xiaobing.AI, Hong Kong University of Science and Technology, Xiaobing.AI
Abstract:
Recent research has demonstrated that the multitask fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately - for instance, its "politeness" - due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations. Code and dataset are available at https://github.com/ChenDelong1999/polite-flamingo



Paperid:1967
Authors:Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China
Abstract:
RetrievalAugmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.



Paperid:1968
Authors:Qian Chen, Taolin Zhang, Dongyang Li, Xiaofeng He
School of Computer Science and Technology, East China Normal University, Shanghai, China, Alibaba Group, School of Computer Science and Technology, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China NPPA Key Laboratory of Publishing Integration Development, ECNUP, Shanghai, China
Abstract:
The minimal feature removal problem in the posthoc explanation area aims to identify the minimal feature set (MFS). Prior studies using the greedy algorithm to calculate the minimal feature set lack the exploration of feature interactions under a monotonic assumption which cannot be satisfied in general scenarios. In order to address the above limitations, we propose a Cooperative Integrated Dynamic Refining method (CIDR) to efficiently discover minimal feature sets. Specifically, we design Cooperative Integrated Gradients (CIG) to detect interactions between features. By incorporating CIG and characteristics of the minimal feature set, we transform the minimal feature removal problem into a knapsack problem. Additionally, we devise an auxiliary Minimal Feature Refinement algorithm to determine the minimal feature set from numerous candidate sets. To the best of our knowledge, our work is the first to address the minimal feature removal problem in the field of natural language processing. Extensive experiments demonstrate that CIDR is capable of tracing representative minimal feature sets with improved interpretability across various models and datasets.



Paperid:1969
Authors:Ruirui Chen, Chengwei Qin, Weifeng Jiang, Dongkyu Choi
Agency for Science, Technology and Research (A*STAR), Nanyang Technological University, Nanyang Technological University, Agency for Science, Technology and Research
Abstract:
Event extraction is an important task in natural language processing that focuses on mining eventrelated information from unstructured text. Despite considerable advancements, it is still challenging to achieve satisfactory performance in this task, and issues like data scarcity and imbalance obstruct progress. In this paper, we introduce an innovative approach where we employ Large Language Models (LLMs) as expert annotators for event extraction. We strategically include sample data from the training dataset in the prompt as a reference, ensuring alignment between the data distribution of LLM-generated samples and that of the benchmark dataset. This enables us to craft an augmented dataset that complements existing benchmarks, alleviating the challenges of data imbalance and scarcity and thereby enhancing the performance of fine-tuned models. We conducted extensive experiments to validate the efficacy of our proposed method, and we believe that this approach holds great potential for propelling the development and application of more advanced and reliable event extraction systems in real-world scenarios.



Paperid:1970
Authors:Wei Chen, Yuxuan Liu, Zhao Zhang, Fuzhen Zhuang, Jiang Zhong
Institute of Artificial Intelligence, Beihang University, Beijing 100191, China, College of Computer Science, Chongqing University, Chongqing 400044, China, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, Institute of Artificial Intelligence, Beihang University, Beijing 100191, China Zhongguancun Laboratory, Beijing, China, College of Computer Science, Chongqing University, Chongqing 400044, China
Abstract:
Aspect prediction (AP) and sentiment prediction (SP) are representative applications in finegrained sentiment anal- ysis. They can be considered as sequential tasks, where AP identifies mentioned aspects in a sentence, and SP infers fine-grained sentiments for these aspects. Recent models perform the aspect-sentiment prediction in a joint man-ner, but heavily rely on the feature interactions of aspect and sentiment. One drawback is that they ignore correlation strength varies between aspect features and sentiment fea- tures across different sentences, and employ a fixed feature interaction strategy may limit effective knowledge transfer across tasks. To tackle this issue, in this paper, we propose an Adaptive Inter-task Feature Interaction framework, AIFI, for joint aspect-sentiment prediction. Specifically, we introduce a novel contrast-based alignment method based on contrastive learning. Our approach considers the AP-specific and SP-specific representations of a given sentence as a positive pair, while representation of another random sentence serves as a negative example. Moreover, we propose an inter-task feature correlation network to predict the contrast strength, which is determined by the temperature coefficient in the InfoNCE loss. This dynamic correlation adjustment enhances model’s ability to capture proper feature interactions more efficiently. Experimental results on three datasets validate the effectiveness of our approach.



Paperid:1971
Authors:Xinhao Chen, Chong Yang, Changzhi Sun, Man Lan, Aimin Zhou
School of Computer Science and Technology, East China Normal University, Shanghai, China AntGroup, Shanghai, China, AntGroup, Shanghai, China, Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China Shanghai Institute of AI for Education, East China Normal University, Shanghai, China, School of Computer Science and Technology, East China Normal University, Shanghai, China
Abstract:
We study the problem of extracting emotions and the causes behind these emotions in conversations. Existing methods either tackle them separately or jointly model them at the coarsegrained level of emotions (fewer emotion categories) and causes (utterance-level causes). In this work, we aim to jointly extract more fine-grained emotions and causes. We construct a fine-grained dataset FG-RECCON, includes 16 fine-grained emotion categories and span-level causes. To further improve the fine-grained extraction performance, we propose to utilize the casual discourse knowledge in a knowledge distillation way. Specifically, the teacher model learns to predict causal connective words between utterances, and then guides the student model in identifying both the fine-grained emotion labels and causal spans. Experimental results demonstrate that our distillation method achieves state-of-the-art performance on both RECCON and FG-RECCON dataset.



Paperid:1972
Authors:Xinjie Chen, Kai Fan, Wei Luo, Linlin Zhang, Libo Zhao, Xinggao Liu, Zhongqiang Huang
Zhejiang University, Alibaba DAMO Academy, Alibaba DAMO Academy, Zhejiang University, South China University of Technology, Zhejiang University, Alibaba DAMO Academy
Abstract:
To achieve highquality translation with low latency, a Simultaneous Speech Translation (SimulST) system relies on a policy module to decide whether to translate immediately or wait for additional streaming input, along with a translation model capable of effectively handling partial speech input. Prior research has tackled these components separately, either using ``wait-k'' policies based on fixed-length segments or detected word boundaries, or dynamic policies based on different strategies (e.g., meaningful units), while employing offline models for prefix-to-prefix translation. In this paper, we propose Divergence-Guided Simultaneous Speech Translation (DiG-SST), a tightly integrated approach focusing on both translation quality and latency for streaming input. Specifically, we introduce a simple yet effective prefix-based strategy for training translation models with partial speech input, and develop an adaptive policy that makes read/write decisions for the translation model based on the expected divergence in translation distributions resulting from future input. Our experiments on multiple translation directions of the MuST-C benchmark demonstrate that our approach achieves a better trade-off between translation quality and latency compared to existing methods.



Paperid:1973
Authors:Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao
University of Science and Technology of China, University of Science and Technology of China, MOE Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and Telecommunications, State Key Laboratory of Communication Content Cognition, People’s Daily Online, Beijing, China, University of Science and Technology of China
Abstract:
While large language models (LLMs) have exhibited impressive instructionfollowing capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.



Paperid:1974
Authors:Yuheng Chen, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Pretrained language models (PLMs) contain vast amounts of factual knowledge, but how the knowledge is stored in the parameters remains unclear. This paper delves into the complex task of understanding how factual knowledge is stored in multilingual PLMs, and introduces the Architecture-adapted Multilingual Integrated Gradients method, which successfully localizes knowledge neurons more precisely compared to current methods, and is more universal across various architectures and languages. Moreover, we conduct an in-depth exploration on knowledge neurons, leading to the following two important discoveries: (1) The discovery of Language-Independent Knowledge Neurons, which store factual knowledge in a form that transcends language. We design cross-lingual knowledge editing experiments, demonstrating that the PLMs can accomplish this task based on language-independent neurons; (2) The discovery of Degenerate Knowledge Neurons, a novel type of neuron showing that different knowledge neurons can store the same fact. Its property of functional overlap endows the PLMs with a robust mastery of factual knowledge. We design fact-checking experiments, proving that the degenerate knowledge neurons can help the PLMs to detect wrong facts. Experiments corroborate these findings, shedding light on the mechanisms of factual knowledge storage in multilingual PLMs, and contribute valuable insights to the field. The code is available at https://github.com/heng840/AMIG.



Paperid:1975
Authors:Yuyan Chen, Yichen Yuan, Panjun Liu, Dayiheng Liu, Qinghao Guan, Mengfei Guo, Haiming Peng, Bang Liu, Zhixu Li, Yanghua Xiao
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Institute of Automation, Chinese Academy of Sciences, School of Computer Science, Beijing Institute of Technology, Alibaba DAMO Academy, University of Zurich, Beijing Jiaotong University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, RALI \& Mila, Université de Montréal, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University Fudan-Aishu Cognitive Intelligence Joint Research Center, Shanghai, China
Abstract:
Humor is a crucial part of human communication. Understanding humor and generating humorous responses in dialogue can provide natural and empathic humancomputer interactions. However, most existing pre-trained language models (PLMs) perform unsatisfactorily in humor generation. On one hand, the serious shortage of humor corpus and datasets pose challenges for constructing models that can understand and generate humorous expressions. On the other hand, humor generation relies on rich knowledge and commonsense, which is often tacit and unspoken. In this paper, we construct the largest Chinese Explainable Humor Response Dataset to date with chain-of-humor and humor mind map annotations, which can be used to comprehensively evaluate as well as improve the humorous response ability of PLMs. We further design humor-related auxiliary tasks to further enhance PLMs' humorous response performance. Extensive evaluations demonstrate that our proposed dataset and auxiliary tasks effectively help PLMs to generate humorous responses, laying the groundwork for future humor research.



Paperid:1976
Authors:Siyuan Cheng, Ningyu Zhang, Bozhong Tian, Xi Chen, Qingbin Liu, Huajun Chen
Zhejiang University Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph Platform and Content Group, Tencent, Zhejiang University Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph, Zhejiang University Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph, Platform and Content Group, Tencent, Platform and Content Group, Tencent, Zhejiang University Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph Donghai Laboratory
Abstract:
Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language modelbased KG embeddings are usually deployed as static artifacts, making them difficult to modify post-deployment without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. This task is designed to facilitate rapid, data-efficient updates to KG embeddings without compromising the performance of other aspects. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hypernetwork to edit/add facts. Our comprehensive experimental results reveal that KGEditor excels in updating specific facts without impacting the overall performance, even when faced with limited training resources. Code and datasets will be available at https://github.com/AnonymousForPapers/DeltaKG.



Paperid:1977
Authors:Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Xianwei Zhuang, Yuexian Zou
Peking University, Peking University, Peking University, Peking University, Peking University, Peking University
Abstract:
MultiIntent spoken language understanding (SLU) can handle complicated utterances expressing multiple intents, which has attracted increasing attention from researchers. Although existing models have achieved promising performance, most of them still suffer from two leading problems: (1) each intent has its specific scope and the semantic information outside the scope might potentially hinder accurate predictions, i.e. scope barrier; (2) only the guidance from intent to slot is modeled but the guidance from slot to intent is often neglected, i.e. unidirectional guidance. In this paper, we propose a novel Multi-Intent SLU framework termed HAOT, which utilizes hierarchical attention to divide the scopes of each intent and applies optimal transport to achieve the mutual guidance between slot and intent. Experiments demonstrate that our model achieves state-of-the-art performance on two public Multi-Intent SLU datasets, obtaining the 3.4 improvement on MixATIS dataset compared to the previous best models in overall accuracy.



Paperid:1978
Authors:Yi Cheng, Wenge Liu, Jian Wang, Chak Tou Leong, Yi Ouyang, Wenjie Li, Xian Wu, Yefeng Zheng
The Hong Kong Polytechnic University, Baidu Inc., Beijing, China, The Hong Kong Polytechnic University, The Hong Kong Polytechnic University, Jarvis Research Center, Tencent YouTu Lab, The Hong Kong Polytechnic University, Jarvis Research Center, Tencent YouTu Lab, Jarvis Research Center, Tencent YouTu Lab
Abstract:
In recent years, there has been a growing interest in exploring dialogues with more complex goals, such as negotiation, persuasion, and emotional support, which go beyond traditional servicefocused dialogue systems. Apart from the requirement for much more sophisticated strategic reasoning and communication skills, a significant challenge of these tasks lies in the difficulty of objectively measuring the achievement of their goals in a quantifiable way, making it difficult for existing research to directly optimize the dialogue procedure towards them. In our work, we emphasize the multifaceted nature of complex dialogue goals and argue that it is more feasible to accomplish them by comprehensively considering and jointly promoting their different aspects. To this end, we propose a novel dialogue framework, Cooper, which coordinates multiple specialized agents, each dedicated to a specific dialogue goal aspect separately, to approach the complex objective. Through this divide-and-conquer manner, we make complex dialogue goals more approachable and elicit greater intelligence via the collaboration of individual agents. Experiments on persuasion and emotional support dialogues demonstrate the superiority of our method over a set of competitive baselines. Our codes are available at https://github.com/YiCheng98/Cooper.



Paperid:1979
Authors:Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee
Korea University, Seoul, Korea University, Seoul, Korea University, Seoul
Abstract:
Diffusionbased generative models have recently exhibited powerful generative performance. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, we introduce decoupled denoising diffusion models (DDDMs) with disentangled representations, which can enable effective style transfers for each attribute in generative models. In particular, we apply DDDMs for voice conversion (VC) tasks, tackling the intricate challenge of disentangling and individually transferring each speech attributes such as linguistic information, intonation, and timbre. First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for style transfer with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance even when using a smaller model size. Audio samples are available at https://hayeong0.github.io/DDDM-VC-demo/.



Paperid:1980
Authors:Timothy Chu, Zhao Song, Chiwun Yang
Google, Adobe Research, Sun Yat-sen University
Abstract:
Large language models (LLMs) and generative AI have played a transformative role in computer research and applications. Controversy has arisen as to whether these models output copyrighted data, which can occur if the data the models are trained on is copyrighted. LLMs are built on the transformer neural network architecture, which in turn relies on a mathematical computation called Attention that uses the softmax function. In this paper, we observe that large language model training and optimization can be seen as a softmax regression problem. We then establish a method of efficiently performing softmax regression, in a way that prevents the regression function from generating copyright data. This establishes a theoretical method of training large language models in a way that avoids generating copyright data.



Paperid:1981
Authors:Maxime Darrin, Guillaume Staerman, Eduardo Dadalto Camara Gomes, Jackie C. K. Cheung, Pablo Piantanida, Pierre Colombo
International Laboratory on Learning Systems MILA - Quebec AI Institute McGill University Université Paris-Saclay, Université Paris-Saclay CNRS INRIA, CEA, Paris, Université Paris-Saclay Laboratoire signaux et systèmes CNRS CentraleSupelec, McGill University MILA - Quebec AI Institute Canada CIFAR AI Chair, Mila, International Laboratory on Learning Systems MILA - Quebec AI Institute Université Paris-Saclay CNRS, Université Paris-Saclay CentraleSupelec Equal, Paris MICS
Abstract:
Outof-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on anomaly scores (\textit{e.g.}, Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results can be achieved, provided that an oracle selects the best layer. We propose a data-driven, unsupervised method to leverage this observation to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a more significant number of classes (up to 150), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results comparable to using the best layer according to an oracle while removing manual feature selection altogether.



Paperid:1982
Authors:Zahra Delbari, Nafise Sadat Moosavi, Mohammad Taher Pilehvar
Tehran Institute for Advanced Studies, Department of Computer Science, University of Sheffield, Cardiff University
Abstract:
With the alarming rise of hate speech in online communities, the demand for effective NLP models to identify instances of offensive language has reached a critical point. However, the development of such models heavily relies on the availability of annotated datasets, which are scarce, particularly for lessstudied languages. To bridge this gap for the Persian language, we present a novel dataset specifically tailored to multi-label hate speech detection. Our dataset, called Phate, consists of an extensive collection of over seven thousand manually-annotated Persian tweets, offering a rich resource for training and evaluating hate speech detection models on this language. Notably, each annotation in our dataset specifies the targeted group of hate speech and includes a span of the tweet which elucidates the rationale behind the assigned label. The incorporation of these information expands the potential applications of our dataset, facilitating the detection of targeted online harm or allowing the benchmark to serve research on interpretability of hate speech detection models. The dataset, annotation guideline, and all associated codes are accessible at https://github.com/Zahra-D/Phate.



Paperid:1983
Authors:Qiuyu Ding, Hailong Cao, Tiejun Zhao
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
Most Bilingual Lexicon Induction (BLI) methods retrieve word translation pairs by finding the closest target word for a given source word based on crosslingual word embeddings (WEs). However, we find that solely retrieving translation from the source-to-target perspective leads to some false positive translation pairs, which significantly harm the precision of BLI. To address this problem, we propose a novel and effective method to improve translation pair retrieval in cross-lingual WEs. Specifically, we consider both source-side and target-side perspectives throughout the retrieval process to alleviate false positive word pairings that emanate from a single perspective. On a benchmark dataset of BLI, our proposed method achieves competitive performance compared to existing state-of-the-art (SOTA) methods. It demonstrates effectiveness and robustness across six experimental languages, including similar language pairs and distant language pairs, under both supervised and unsupervised settings.



Paperid:1984
Authors:Zeyuan Ding, Zhihao Yang, Ling Luo, Yuanyuan Sun, Hongfei Lin
School of Computer Science and Technology, Dalian University of Technology, China, School of Computer Science and Technology, Dalian University of Technology, China, School of Computer Science and Technology, Dalian University of Technology, China, School of Computer Science and Technology, Dalian University of Technology, China, School of Computer Science and Technology, Dalian University of Technology, China
Abstract:
Retrieving appropriate records from the external knowledge base to generate informative responses is the core capability of endto-end task-oriented dialogue systems (EToDs). Most of the existing methods additionally train the retrieval model or use the memory network to retrieve the knowledge base, which decouples the knowledge retrieval task from the response generation task, making it difficult to jointly optimize and failing to capture the internal relationship between the two tasks. In this paper, we propose a simple and unified generative model for task-oriented dialogue systems, which recasts the EToDs task as a single sequence generation task and uses maximum likelihood training to train the two tasks in a unified manner. To prevent the generation of non-existent records, we design the prefix trie to constrain the model generation, which ensures consistency between the generated records and the existing records in the knowledge base. Experimental results on three public benchmark datasets demonstrate that our method achieves robust performance on generating system responses and outperforms the baseline systems. To facilitate future research in this area, the code is available at https://github.com/dzy1011/Uni-ToD.



Paperid:1985
Authors:Zixiang Ding, Guoqing Jiang, Shuai Zhang, Lin Guo, Wei Lin
Meituan, Meituan, Meituan, Meituan, Independent Researcher
Abstract:
We observe two phenomenons with respect to quantity and capacity: 1) more teacher is not always better for multiteacher knowledge distillation, and 2) stronger teacher is not always better for single-teacher knowledge distillation. To trade off the quantity and capacity of teacher ensemble, in this paper, we propose a new distillation paradigm named Dynamic Knowledge Distillation (DynaKD) that learn an adaptive categorical distribution to stochastically employ a teacher from a teacher ensemble in each step, to transfer knowledge from teacher ensemble into student. DynaKD has three advantages: 1) it can preserve diversity of each teacher via one-to-one distillation manner instead of several-for-one, 2) it can make the best of powerful teacher via those multi-level assistant teachers in ensemble, and 3) it can also dynamically determine the importance of each teacher for various tasks. To verify the effectiveness of the proposed approach, we conduct extensive experiments for BERT compression on GLUE benchmark. Experimental results show that the proposed approach achieves state-of-the-art score compared to previous compression approaches on five out of seven downstream tasks, including pushing MRPC F1 and accuracy to 92.2 (1.4 point absolute improvement), RTE accuracy to 76.2 (2.8 point absolute improvement). Moreover, we conduct also extensive experiments for image classification on CIFAR-100. Similarly, DynaKD achieves also state-of-the-art performance.



Paperid:1986
Authors:Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu
MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shenzhen Research Institute of Big Data, AISpeech Ltd, MoE Key Lab of Artificial Intelligence, AI Institute; X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Abstract:
The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature melspectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing. Audio samples are available at https://cpdu.github.io/unicats.



Paperid:1987
Authors:Hang Du, Guoshun Nan, Sicheng Zhang, Binzhu Xie, Junrui Xu, Hehe Fan, Qimei Cui, Xiaofeng Tao, Xudong Jiang
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Zhejiang University, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Nanyang Technological University
Abstract:
Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentencelevel MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc. The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU.



Paperid:1988
Authors:Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji
Zhejiang University, Stony Brook University, Anytime.AI, Zhejiang University, Zhejiang University
Abstract:
Code Clone Detection, which aims to retrieve functionally similar programs from large code bases, has been attracting increasing attention. Modern software often involves a diverse range of programming languages. However, current code clone detection methods are generally limited to only a few popular programming languages due to insufficient annotated data as well as their own model design constraints. To address these issues, we present AdaCCD, a novel crosslingual adaptation method that can detect cloned codes in a new language without annotations in that language. AdaCCD leverages language-agnostic code representations from pre-trained programming language models and propose an Adaptively Refined Contrastive Learning framework to transfer knowledge from resource-rich languages to resource-poor languages. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages. AdaCCD achieves significant improvements over other baselines, and achieve comparable performance to supervised fine-tuning.



Paperid:1989
Authors:Subhabrata Dutta, Ishan Pandey, Joykirat Singh, Sunny Manchanda, Soumen Chakrabarti, Tanmoy Chakraborty
Indian Institute of Technology Delhi, Indraprastha Institute of Information Technology Delhi, Indraprastha Institute of Information Technology Delhi, DYSL-AI, India, Indian Institute of Technology Bombay, Indian Institute of Technology Delhi
Abstract:
Large Language Models (LLM) exhibit zeroshot mathematical reasoning capacity as a behavior emergent with scale, commonly manifesting as chain-of-thoughts (CoT) reasoning. However, multiple empirical findings suggest that this prowess is exclusive to LLMs that have exorbitant sizes (beyond 50 billion parameters). Meanwhile, educational neuroscientists suggest that symbolic algebraic manipulation be introduced around the same time as arithmetic word problems so as to modularize language-to-formulation, symbolic manipulation of the formulation, and endgame arithmetic. In this paper, we start with the hypothesis that much smaller LMs, which are weak at multi-step reasoning, can achieve reasonable arithmetic reasoning if arithmetic word problems are posed as a formalize-then-solve task. In our architecture, which we call SyReLM, the LM serves the role of a translator to map natural language arithmetic questions into a formal language (FL) description. A symbolic solver then evaluates the FL expression to obtain the answer. A small frozen LM, equipped with an efficient low-rank adapter, is capable of generating FL expressions that incorporate natural language descriptions of the arithmetic problem (e.g., variable names and their purposes, formal expressions combining variables, etc.). We adopt policy-gradient reinforcement learning to train the adapted LM, informed by the non-differentiable symbolic solver. This marks a sharp departure from the recent development in tool-augmented LLMs, in which the external tools (e.g., calculator, Web search, etc.) are essentially detached from the learning phase of the LM. SyReLM shows massive improvements (e.g., +30.65 absolute point improvement in accuracy on the SVAMP dataset using GPT-J 6B model) over base LMs, while keeping our testbed easy to diagnose and interpret, and within the reach of most researchers.



Paperid:1990
Authors:Caoyun Fan, Jindou Chen, Yaohui Jin, Hao He
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Game theory, as an analytical tool, is frequently utilized to analyze human behavior in social science research. With the high alignment between the behavior of Large Language Models (LLMs) and humans, a promising research direction is to employ LLMs as substitutes for humans in game experiments, enabling social science research. However, despite numerous empirical researches on the combination of LLMs and game theory, the capability boundaries of LLMs in game theory remain unclear. In this research, we endeavor to systematically analyze LLMs in the context of game theory. Specifically, rationality, as the fundamental principle of game theory, serves as the metric for evaluating players' behavior building a clear desire, refining belief about uncertainty, and taking optimal actions. Accordingly, we select three classical games (dictator game, Rock-Paper-Scissors, and ring-network game) to analyze to what extent LLMs can achieve rationality in these three aspects. The experimental results indicate that even the current state-of-the-art LLM (GPT-4) exhibits substantial disparities compared to humans in game theory. For instance, LLMs struggle to build desires based on uncommon preferences, fail to refine belief from many simple patterns, and may overlook or modify refined belief when taking actions. Therefore, we consider that introducing LLMs into game experiments in the field of social science should be approached with greater caution.



Paperid:1991
Authors:Chenghao Fan, Wei Wei, Xiaoye Qu, Zhenyi Lu, Wenfeng Xie, Yu Cheng, Dangyang Chen
Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology, Huazhong University of Science and Technology Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL), Ping An Property & Casualty Insurance Company of China, Ltd, The Chinese University of Hong Kong, Ping An Property&Casualty insurance company of China, Ltd
Abstract:
Recently, prompttuning with pre-trained language models (PLMs) has demonstrated the significantly enhancing ability of relation extraction (RE) tasks. However, in low-resource scenarios, where the available training data is scarce, previous prompt-based methods may still perform poorly for prompt-based representation learning due to a superficial understanding of the relation. To this end, we highlight the importance of learning high-quality relation representation in low-resource scenarios for RE, and propose a novel prompt-based relation representation method, named MVRE (Multi-View Relation Extraction), to better leverage the capacity of PLMs to improve the performance of RE within the low-resource prompt-tuning paradigm. Specifically, MVRE decouples each relation into different perspectives to encompass multi-view relation representations for maximizing the likelihood during relation inference. Furthermore, we also design a Global-Local loss and a Dynamic-Initialization method for better alignment of the multi-view relation-representing virtual words, containing the semantics of relation labels during the optimization learning process and initialization. Extensive experiments on three benchmark datasets show that our method can achieve state-of-the-art in low-resource settings.



Paperid:1992
Authors:Zipeng Fan, Jing Zhang, Peng Zhang, Qianxi Lin, Hui Gao
Tianjin University, Tianjin University, Tianjin University, Tianjin University, Tianjin University
Abstract:
In recent years, researchers have developed novel QuantumInspired Neural Network (QINN) frameworks for the Natural Language Processing (NLP) tasks, inspired by the theoretical investigations of quantum cognition. However, we have found that the training efficiency of QINNs is significantly lower than that of classical networks. We analyze the unitary transformation modules of existing QINNs based on the time displacement symmetry of quantum mechanics and discover that they are resembling a mathematical form similar to the first-order Euler method. The high truncation error associated with Euler method affects the training efficiency of QINNs. In order to enhance the training efficiency of QINNs, we generalize QINNs' unitary transformation modules to the Quantum-like high-order Runge-Kutta methods (QRKs). Moreover, we present the results of experiments on conversation emotion recognition and text classification tasks to validate the effectiveness of the proposed approach.



Paperid:1993
Authors:Meng Fang, Shilong Deng, Yudi Zhang, Zijing Shi, Ling Chen, Mykola Pechenizkiy, Jun Wang
University of Liverpool Eindhoven University of Technology, University of Liverpool, Eindhoven University of Technology, University of Technology Sydney, University of Technology Sydney, Eindhoven University of Technology, University College London
Abstract:
A wide range of realworld applications is characterized by their symbolic nature, necessitating a strong capability for symbolic reasoning. This paper investigates the potential application of Large Language Models (LLMs) as symbolic reasoners. We focus on text-based games, significant benchmarks for agents with natural language capabilities, particularly in symbolic tasks like math, map reading, sorting, and applying common sense in text-based worlds. To facilitate these agents, we propose an LLM agent designed to tackle symbolic challenges and achieve in-game objectives. We begin by initializing the LLM agent and informing it of its role. The agent then receives observations and a set of valid actions from the text-based games, along with a specific symbolic module. With these inputs, the LLM agent chooses an action and interacts with the game environments. Our experimental results demonstrate that our method significantly enhances the capability of LLMs as automated agents for symbolic reasoning, and our LLM agent is effective in text-based games involving symbolic tasks, achieving an average performance of 88% across all tasks.



Paperid:1994
Authors:Yan Fang, Qingyao Ai, Jingtao Zhan, Yiqun Liu, Xiaolong Wu, Zhao Cao
Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory, Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory, Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory, Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory, Huawei Poisson Lab, Huawei Poisson Lab
Abstract:
Recently, dense retrieval (DR) models, which represent queries and documents with fixedwidth vectors and retrieve relevant ones via nearest neighbor search, have drawn increasing attention from the IR community. However, previous studies have shown that the effectiveness of DR critically relies on sufficient training signals, which leads to severe performance degradation when applied in out-of-domain scenarios, where large-scale training data are usually unavailable. To solve this problem, existing studies adopt a data-augmentation-plus-joint-training paradigm to construct weak/pseudo supervisions on the target domain and combine them with the large-scale human annotated data on the source domain to train the DR models. However, they don't explicitly distinguish the data and the supervision signals in the training process and simply assume that the DR models are mighty enough to capture and memorize different domain knowledge and relevance matching patterns without guidance, which, as shown in this paper, is not true. Based on this observation, we propose a Robust Multi-Supervision Combining strategy (RMSC) that decouples the domain and supervision signals by explicitly telling the DR models how the domain data and supervision signals are combined in the training data with specially designed soft tokens. With the extra soft tokens to store the domain-specific and supervision-specific knowledge, RMSC allows the DR models to conduct retrieval based on human-like relevance matching patterns and target-specific language distribution on the target domain without human annotations. Extensive experiments on zero-shot DR benchmarks show that RMSC significantly improves the ranking performance on the target domain compared to strong DR baselines and domain adaptation methods, while being stable during training and can be combined with query generation or second-stage pre-training.



Paperid:1995
Authors:Yu Fu, Deyi Xiong, Yue Dong
University of California, Riverside, Tianjin University, University of California, Riverside
Abstract:
To mitigate potential risks associated with language models (LMs), recent AI detection research proposes incorporating watermarks into machinegenerated text through random vocabulary restrictions and utilizing this information for detection. In this paper, we show that watermarking algorithms designed for LMs cannot be seamlessly applied to conditional text generation (CTG) tasks without a notable decline in downstream task performance. To address this issue, we introduce a simple yet effective semantic-aware watermarking algorithm that considers the characteristics of conditional text generation with the input context. Compared to the baseline watermarks, our proposed watermark yields significant improvements in both automatic and human evaluations across various text generation models, including BART and Flan-T5, for CTG tasks such as summarization and data-to-text generation. Meanwhile, it maintains detection ability with higher z-scores but lower AUC scores, suggesting the presence of a detection paradox that poses additional challenges for watermarking CTG.



Paperid:1996
Authors:Zihao Fu, Meiru Zhang, Zaiqiao Meng, Yannan Shen, David Buckeridge, Nigel Collier
University of Cambridge, University of Cambridge, University of Glasgow University of Cambridge, McGill University, McGill University, University of Cambridge
Abstract:
Infectious disease outbreaks continue to pose a significant threat to human health and wellbeing. To improve disease surveillance and understanding of disease spread, several surveillance systems have been developed to monitor daily news alerts and social media. However, existing systems lack thorough epidemiological analysis in relation to corresponding alerts or news, largely due to the scarcity of well-annotated reports data. To address this gap, we introduce the Biomedical Alert News Dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions. These questions necessitate the model's expert reasoning abilities, thereby offering valuable insights into the outbreak of the disease. The BAND dataset brings new challenges to the NLP world, requiring better inference capability of the content and the ability to infer important information. We provide several benchmark tasks, including Named Entity Recognition (NER), Question Answering (QA), and Event Extraction (EE), to demonstrate existing models' capabilities and limitations in handling epidemiology-specific tasks. It is worth noting that some models may lack the human-like inference capability required to fully utilize the corpus. To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions, making it a valuable resource for epidemiologists and NLP researchers alike.



Paperid:1997
Authors:Kaizhi Gao, Tianyu Wang, Zhongjing Ma, Suli Zou
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology
Abstract:
Pretrained encoder-decoder models are widely applied in Task-Oriented Dialog (TOD) systems on the session level, mainly focusing on modeling the dialog semantic information. Dialogs imply structural information indicating the interaction among user utterances, belief states, database search results, system acts and responses, which is also crucial for TOD systems. In addition, for the system acts, additional pre-training and datasets are considered to improve their accuracies, undoubtedly introducing a burden. Therefore, a novel end-to-end TOD system named Winnie is proposed in this paper to improve the TOD performance. First, to make full use of the intrinsic structural information, supervised contrastive learning is adopted to narrow the gap in the representation space between text representations of the same category and enlarge the overall continuous representation margin between text representations of different categories in dialog context. Then, a system act classification task is introduced for policy optimization during fine-tuning. Empirical results show that Winnie substantially improves the performance of the TOD system. By introducing the supervised contrastive and system act classification losses, Winnie achieves state-of-the-art results on benchmark datasets, including MultiWOZ2.2, In-Car, and Camrest676. Their end-to-end combined scores are improved by 3.2, 1.9, and 1.1 points, respectively.



Paperid:1998
Authors:Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen Fang, Xin Xin, Pengjie Ren, Zhumin Chen, Jun Ma, Zhaochun Ren
Shandong University, Shandong University, Shandong University, Shandong University, Shandong University, Shandong University, Shandong University, Shandong University, Leiden University
Abstract:
Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extending the capability of LLMs. Although there are some works that employ opensource LLMs for the tool-learning task, most of them are trained in a controlled environment in which LLMs only learn to execute the human-provided tools. However, selecting proper tools from the large toolset is also a crucial ability for the tool-learning model to be applied in real-world applications. Existing methods usually directly employ self-instruction methods to train the model, which ignores differences in tool complexity. In this paper, we propose the Confucius a novel tool-learning framework to train LLM to use complicated tools in real-world scenarios, which contains two main phases: (1) We first propose a multi-stage learning method to teach the LLM to use various tools from an easy-to-difficult curriculum; (2) thenceforth, we propose the Iterative Self-instruct from Introspective Feedback (ISIF) to dynamically construct the dataset to improve the ability to use the complicated tool. Extensive experiments conducted on both controlled and real-world settings demonstrate the superiority of our tool-learning framework in the real-world application scenario compared to both tuning-free (e.g., ChatGPT, Claude) and tuning-based baselines (e.g., GPT4Tools).



Paperid:1999
Authors:Xiang Gao, Kamalika Das
Intuit, Intuit
Abstract:
Large language models (LLMs) are becoming increasingly important for machine learning applications. However, it can be challenging to align LLMs with our intent, particularly when we want to generate content that is preferable over others or when we want the LLM to respond in a certain style or tone that is hard to describe. To address this challenge, we propose an approach that uses contrastive examples to better describe our intent. This involves providing positive examples that illustrate the true intent, along with negative examples that show what characteristics we want LLMs to avoid. The negative examples can be retrieved from labeled data, written by a human, or generated by the LLM itself. Before generating an answer, we ask the model to analyze the examples to teach itself what to avoid. This reasoning step provides the model with the appropriate articulation of the user's need and guides it towards generting a better answer. We tested our approach on both synthesized and realworld datasets, including StackExchange and Reddit, and found that it significantly improves performance compared to standard few-shot prompting.



Paperid:2000
Authors:Ling Ge, Chunming Hu, Guanghui Ma, Jihong Liu, Hong Zhang
School of Computer Science and Engineering, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China College of Software, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China, School of Mechanical Engineering and Automation, Beihang University, Beijing, China, National Computer Network Emergency Response Technical Team / Coordination Center of China, Beijing, China
Abstract:
MultiSource cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift. Existing methods typically focus on weighting the predictions produced by language-specific classifiers of different sources that follow a shared encoder. However, all source languages share the same encoder, which is updated by all these languages. The extracted representations inevitably contain different source languages' information, which may disturb the learning of the language-specific classifiers. Additionally, due to the language gap, language-specific classifiers trained with source labels are unable to make accurate predictions for the target language. Both facts impair the model's performance. To address these challenges, we propose a Disentangled and Adaptive Network ~(DA-Net). Firstly, we devise a feedback-guided collaborative disentanglement method that seeks to purify input representations of classifiers, thereby mitigating mutual interference from multiple sources. Secondly, we propose a class-aware parallel adaptation method that aligns class-level distributions for each source-target language pair, thereby alleviating the language pairs' language gap. Experimental results on three different tasks involving 38 languages validate the effectiveness of our approach.



Paperid:2001
Authors:Ling Ge, Chunming Hu, Guanghui Ma, Jihong Liu, Hong Zhang
School of Computer Science and Engineering, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China College of Software, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China, School of Mechanical Engineering and Automation, Beihang University, Beijing, China, National Computer Network Emergency Response Technical Team / Coordination Center of China, Beijing, China
Abstract:
The knowledge distillationbased approaches have recently yielded state-of-the-art (SOTA) results for cross-lingual NER tasks in zero-shot scenarios. These approaches typically employ a teacher network trained with the labelled source (rich-resource) language to infer pseudo-soft labels for the unlabelled target (zero-shot) language, and force a student network to approximate these pseudo labels to achieve knowledge transfer. However, previous works have rarely discussed the issue of pseudo-label noise caused by the source-target language gap, which can mislead the training of the student network and result in negative knowledge transfer. This paper proposes an discrepancy and uncertainty aware Denoising Knowledge Distillation model (DenKD) to tackle this issue. Specifically, DenKD uses a discrepancy-aware denoising representation learning method to optimize the class representations of the target language produced by the teacher network, thus enhancing the quality of pseudo labels and reducing noisy predictions. Further, DenKD employs an uncertainty-aware denoising method to quantify the pseudo-label noise and adjust the focus of the student network on different samples during knowledge distillation, thereby mitigating the noise's adverse effects. We conduct extensive experiments on 28 languages including 4 languages not covered by the pre-trained models, and the results demonstrate the effectiveness of our DenKD.



Paperid:2002
Authors:Walter Gerych, Yara Rizk, Vatche Isahagian, Vinod Muthusamy, Evelyn Duesterwald, Praveen Venkateswaran
MIT CSAIL, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research
Abstract:
There are increasingly many large language models (LLMs) available to the public. While these LLMs have exhibited impressive abilities on a variety of task, any individual LLM in particular may do well on some tasks and worse on others. Additionally, the performance of these models is heavily dependent on the choice of prompt template used. For instance, they exhibit sensitivity to the few shot examples chosen or brittleness to the wording of instructions. Moreover, a prompt template that makes a model perform well for one input may not be the optimal template for another input. This necessitates an approach for adaptively selecting LLM and prompt template pairs for each input. Recent work has shown that the accuracy of LLM's responses is correlated with the LLM's confidence in the response. Thus, a natural choice for selecting which model and prompt template to use is to select the pair that is most confident in its response. However, existing confidence metrics are expensive to calculate necessitating multiple calls to each LLm and prompt pair. We thus propose an approach to predict the confidence of each pair using an auxiliary regression model that is inexpensive to run. Using this auxiliary model, we select the LLM and prompt template with the highest predicted confidence for a given input. Results on a range of benchmark datasets show that our confidence-based instance-level prompt search method consistently improves the performance of LLMs.



Paperid:2003
Authors:Daniel Gilo, Shaul Markovitch
Department of Computer Science Technion - Israel Institute of Technology, Department of Computer Science Technion - Israel Institute of Technology
Abstract:
One of the prominent methods for explaining the decision of a machinelearning classifier is by a counterfactual example. Most current algorithms for generating such examples in the textual domain are based on generative language models. Generative models, however, are trained to minimize a specific loss function in order to fulfill certain requirements for the generated texts. Any change in the requirements may necessitate costly retraining, thus potentially limiting their applicability. In this paper, we present a general search-based framework for generating counterfactual explanations in the textual domain. Our framework is model-agnostic, domain-agnostic, anytime, and does not require retraining in order to adapt to changes in the user requirements. We model the task as a search problem in a space where the initial state is the classified text, and the goal state is a text in a given target class. Our framework includes domain-independent modification operators, but can also exploit domain-specific knowledge through specialized operators. The search algorithm attempts to find a text from the target class with minimal user-specified distance from the original classified object.



Paperid:2004
Authors:Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan
Wangxuan Institute of Computer Technology, Peking University, Meituan, Meituan, Meituan, Wangxuan Institute of Computer Technology, Peking University State Key Laboratory of Media Convergence Production Technology and Systems, Gaoling School of Artificial Intelligence, Renmin University of China
Abstract:
Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the tradeoff between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach ``the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.



Paperid:2005
Authors:Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, K J Joseph, Balaji Vasan Srinivasan
Adobe Research, Adobe Research, Adobe Research, Adobe Research, Adobe Research
Abstract:
Recent advances in multimodal learning has resulted in powerful visionlanguage models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features.



Paperid:2006
Authors:Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, Yanghua Xiao
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, School of Information Science and Engineering, East China University of Science and Technology, Xiaohongshu Inc, Xiaohongshu Inc, School of Data Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, China Fudan-Aishu Cognitive Intelligence Joint Research Center
Abstract:
New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge.Xiezhi comprises multiplechoice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty with 14,041 questions and Xiezhi-Interdiscipline with 10,746 questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. All the evaluation code and data are open sourced in https://github.com/MikeGu721/XiezhiBenchmark



Paperid:2007
Authors:Zihui Gu, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Chengzhong Xu, Ju Fan
Renmin University of China Tencent Inc., Tencent Inc. University of Macau, Tencent Inc., Tencent Inc., University of Macau, Renmin University of China
Abstract:
Instructionfollowing is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction-following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suffer from two main shortcomings, i.e., lack of fine-grained task-level evaluation and reliance on singular instruction expression. To address these problems, this paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset that has two main advantages: (1) DINGO is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests; (2) DINGO includes diverse instructions, generated by both GPT-4 and human experts. Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs.



Paperid:2008
Authors:Wenhao Guan, Yishuang Li, Tao Li, Hukai Huang, Feng Wang, Jiayan Lin, Lingyan Huang, Lin Li, Qingyang Hong
Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University, Xiamen University
Abstract:
The style transfer task in Textto-Speech (TTS) refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt space, including reference speech, emotional facial images, and text descriptions, to control the style of the generated speech in a system. The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects: 1) aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2) efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice. To address these problems, we propose an aligned multi-modal prompt encoder that embeds different modalities into a unified style space, supporting style transfer for different modalities. Additionally, we present a new adaptive style transfer method named Style Adaptive Convolutions (SAConv) to achieve a better style representation. Furthermore, we design a Rectified Flow based Refiner to solve the problem of over-smoothing Mel-spectrogram and generate audio of higher fidelity. Since there is no public dataset for multi-modal TTS, we construct a dataset named MEAD-TTS, which is related to the field of expressive talking head. Our experiments on the MEAD-TTS dataset and out-of-domain datasets demonstrate that MM-TTS can achieve satisfactory results based on multi-modal prompts. The audio samples and constructed dataset are available at https://multimodal-tts.github.io.



Paperid:2009
Authors:Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences, Beijing, China, Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Abstract:
Incorporating factual knowledge in knowledge graph is regarded as a promising approach for mitigating the hallucination of large language models (LLMs). Existing methods usually only use the user's input to query the knowledge graph, thus failing to address the factual hallucination generated by LLMs during its reasoning process. To address this problem, this paper proposes Knowledge Graphbased Retrofitting (KGR), a new framework that incorporates LLMs with KGs to mitigate factual hallucination during the reasoning process by retrofitting the initial draft responses of LLMs based on the factual knowledge stored in KGs. Specifically, KGR leverages LLMs to extract, select, validate, and retrofit factual statements within the model-generated responses, which enables an autonomous knowledge verifying and refining procedure without any additional manual efforts. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs.



Paperid:2010
Authors:Anisha Gunjal, Jihan Yin, Erhan Bas
Scale AI, Scale AI, Scale AI
Abstract:
Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multimodal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained annotations on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for hallucination prevention, we optimize InstructBLIP through our novel Fine-grained Direct Preference Optimization (FDPO). We also train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling (RS). We perform human evaluation on both FDPO and rejection sampling, and find that they reduce hallucination rates in InstructBLIP by 41% and 55% respectively. We also find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores. The dataset is available at https://github.com/hendryx-scale/mhal-detect.



Paperid:2011
Authors:Haoqiang Guo, Sendong Zhao, Haochun Wang, Yanrui Du, Bing Qin
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
Deep learning is now widely used in drug discovery, providing significant acceleration and cost reduction. As the most fundamental building block, molecular representation is essential for predicting molecular properties to enable various downstream applications. Most existing methods attempt to incorporate more information to learn better representations. However, not all features are equally important for a specific task. Ignoring this would potentially compromise the training efficiency and predictive accuracy. To address this issue, we propose a novel approach, which treats language models as an agent and molecular pretraining models as a knowledge base. The agent accentuates taskrelevant features in the molecular representation by understanding the natural language description of the task, just as a tailor customizes clothes for clients. Thus, we call this approach MolTailor. Evaluations demonstrate MolTailor's superior performance over baselines, validating the efficacy of enhancing relevance for molecular representation learning. This illustrates the potential of language model guided optimization to better exploit and unleash the capabilities of existing powerful molecular representation methods. Our code and appendix are available at https://github.com/SCIR-HI/MolTailor.



Paperid:2012
Authors:Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang
Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Toshiba China R&D Center, Beijing, China University of Chinese Academy of Sciences, Beijing, China, Toshiba China R&D Center, Beijing, China, Toshiba China R&D Center, Beijing, China, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Abstract:
Textbased audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation.



Paperid:2013
Authors:Haixia Han, Jiaqing Liang, Jie Shi, Qianyu He, Yanghua Xiao
Shanghai Institute of AI for Education and School of Computer Science and Technology, East China Normal University, School of Data Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai Institute of AI for Education and School of Computer Science and Technology, East China Normal University Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
Abstract:
Generative Language Models (LMs) such as ChatGPT have exhibited remarkable performance across various downstream tasks. Nevertheless, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. Previous studies have devised sophisticated pipelines and prompts to induce large LMs to exhibit the capability for selfcorrection. However, large LMs are explicitly prompted to verify and modify their answers separately rather than completing all steps spontaneously like humans. Moreover, these complex prompts are extremely challenging for small LMs to follow. In this paper, we introduce the Intrinsic Self-Correction (ISC) in generative language models, aiming to correct the initial output of LMs in a self-triggered manner, even for those small LMs with 6 billion parameters. Specifically, we devise a pipeline for constructing self-correction data and propose Partial Answer Masking (PAM), aiming to endow the model with the capability for intrinsic self-correction through fine-tuning. We conduct experiments using LMs with parameters sizes ranging from 6 billion to 13 billion in two tasks, including commonsense reasoning and factual knowledge reasoning. Our experiments demonstrate that the outputs generated using ISC outperform those generated without self-correction. We believe that the output quality of even small LMs can be further improved by empowering them with the ability to intrinsic self-correct.



Paperid:2014
Authors:Jie Han, Yixiong Zou, Haozhao Wang, Jun Wang, Wei Liu, Yao Wu, Tao Zhang, Ruixuan Li
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology, iWudao Tech, Huazhong University of Science and Technology, Banma Network Technology, Banma Network Technology, Huazhong University of Science and Technology
Abstract:
Fewshot intent classification and slot filling are important but challenging tasks due to the scarcity of finely labeled data. Therefore, current works first train a model on source domains with sufficiently labeled data, and then transfer the model to target domains where only rarely labeled data is available. However, experience transferring as a whole usually suffers from gaps that exist among source domains and target domains. For instance, transferring domain-specific-knowledge-related experience is difficult. To tackle this problem, we propose a new method that explicitly decouples the transferring of general-semantic-representation-related experience and the domain-specific-knowledge-related experience. Specifically, for domain-specific-knowledge-related experience, we design two modules to capture intent-slot relation and slot-slot relation respectively. Extensive experiments on Snips and FewJoint datasets show that our method achieves state-of-the-art performance. The method improves the joint accuracy metric from 27.72% to 42.20% in the 1-shot setting, and from 46.54% to 60.79% in the 5-shot setting.



Paperid:2015
Authors:Liqi He, Zuchao Li, Xiantao Cai, Ping Wang
Wuhan University, Wuhan University, Wuhan University, Wuhan University
Abstract:
Chainof-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT. We demonstrate the efficacy of our proposed method on multi-modal ScienceQA and machine translation benchmarks, achieving state-of-the-art performance on ScienceQA. Overall, our approach offers a more robust and effective solution for multi-modal reasoning in language models, enhancing their ability to tackle complex real-world problems.



Paperid:2016
Authors:Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Jiaqing Liang, Yanghua Xiao
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, School of Data Science, Fudan University, School of Data Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, School of Data Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University Fudan-Aishu Cognitive Intelligence Joint Research Center, Shanghai, China
Abstract:
Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multiturn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs’ ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.



Paperid:2017
Authors:Xingwei He, Qianru Zhang, A-Long Jin, Jun Ma, Yuan Yuan, Siu Ming Yiu
The University of Hong Kong, Hong Kong, China, The University of Hong Kong, Hong Kong, China, The University of Hong Kong, Hong Kong, China, The University of Hong Kong, Hong Kong, China, School of Computer Science and Engineering, Beihang University, Beijing, China State Key Laboratory of Software, Development Environment Zhongguancun Laboratory, The University of Hong Kong, Hong Kong, China
Abstract:
Factual error correction (FEC) aims to revise factual errors in false claims with minimal editing, making them faithful to the provided evidence. This task is crucial for alleviating the hallucination problem encountered by large language models. Given the lack of paired data (i.e., false claims and their corresponding correct claims), existing methods typically adopt the ‘maskthen-correct’ paradigm. This paradigm relies solely on unpaired false claims and correct claims, thus being referred to as distantly supervised methods. These methods require a masker to explicitly identify factual errors within false claims before revising with a corrector. However, the absence of paired data to train the masker makes accurately pinpointing factual errors within claims challenging. To mitigate this, we propose to improve FEC by Learning to Inject Factual Errors (LIFE), a three-step distantly supervised method: ‘mask-corrupt-correct’. Specifically, we first train a corruptor using the ‘mask-then-corrupt’ procedure, allowing it to deliberately introduce factual errors into correct text. The corruptor is then applied to correct claims, generating a substantial amount of paired data. After that, we filter out low-quality data, and use the remaining data to train a corrector. Notably, our corrector does not require a masker, thus circumventing the bottleneck associated with explicit factual error identification. Our experiments on a public dataset verify the effectiveness of LIFE in two key aspects: Firstly, it outperforms the previous best-performing distantly supervised method by a notable margin of 10.59 points in SARI Final (19.3% improvement). Secondly, even compared to ChatGPT prompted with in-context examples, LIFE achieves a superiority of 7.16 points in SARI Final.



Paperid:2018
Authors:Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao, Ran Jia, Xu Chen, Shi Han, Zejian Yuan, Dongmei Zhang
Xi'an Jiaotong University, Microsoft, Institute of Software Chinese Academy of Science, Microsoft, Microsoft, Microsoft, Microsoft, Microsoft, Microsoft, Microsoft, Xi'an Jiaotong University, Microsoft
Abstract:
Tabular data analysis is crucial in various fields, and large language models show promise in this area. However, current research mostly focuses on rudimentary tasks like Text2SQL and TableQA, neglecting advanced analysis like forecasting and chart generation. To address this gap, we developed the Text2Analysis benchmark, incorporating advanced analysis tasks that go beyond the SQLcompatible operations and require more in-depth analysis. We also develop five innovative and effective annotation methods, harnessing the capabilities of large language models to enhance data quality and quantity. Additionally, we include unclear queries that resemble real-world user questions to test how well models can understand and tackle such challenges. Finally, we collect 2249 query-result pairs with 347 tables. We evaluate five state-of-the-art models using three different metrics and the results show that our benchmark presents introduces considerable challenge in the field of tabular data analysis, paving the way for more advanced research opportunities.



Paperid:2019
Authors:Zachary Horvitz, Ajay Patel, Chris Callison-Burch, Zhou Yu, Kathleen McKeown
Columbia University, University of Pennsylvania, University of Pennsylvania, Columbia University, Columbia University
Abstract:
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. Target "styles" can be defined in numerous ways, ranging from single attributes (e.g. formality) to authorship (e.g. Shakespeare). Previous unsupervised styletransfer approaches generally rely on significant amounts of labeled data for only a fixed set of styles or require large language models. In contrast, we introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles at inference time. Our parameter-efficient approach, ParaGuide, leverages paraphrase-conditioned diffusion models alongside gradient-based guidance from both off-the-shelf classifiers and strong existing style embedders to transform the style of text while preserving semantic information. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.



Paperid:2020
Authors:Jia Cheng Hu, Roberto Cavicchioli, Giulia Berardinelli, Alessandro Capotondi
University of Modena and Reggio Emilia, University of Modena and Reggio Emilia, University of Modena and Reggio Emilia, University of Modena and Reggio Emilia
Abstract:
The deployment of Pretrained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures. Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5% of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert.



Paperid:2021
Authors:Linmei Hu, Hongyu He, Duokang Wang, Ziwang Zhao, Yingxia Shao, Liqiang Nie
Beijing Institute of Technology, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Harbin Institute of Technology (Shenzhen)
Abstract:
Personality detection aims to detect one's personality traits underlying in social media posts. One challenge of this task is the scarcity of groundtruth personality traits which are collected from self-report questionnaires. Most existing methods learn post features directly by fine-tuning the pre-trained language models under the supervision of limited personality labels. This leads to inferior quality of post features and consequently affects the performance. In addition, they treat personality traits as one-hot classification labels, overlooking the semantic information within them. In this paper, we propose a large language model (LLM) based text augmentation enhanced personality detection model, which distills the LLM's knowledge to enhance the small model for personality detection, even when the LLM fails in this task. Specifically, we enable LLM to generate post analyses (augmentations) from the aspects of semantic, sentiment, and linguistic, which are critical for personality detection. By using contrastive learning to pull them together in the embedding space, the post encoder can better capture the psycho-linguistic information within the post representations, thus improving personality detection. Furthermore, we utilize the LLM to enrich the information of personality labels for enhancing the detection performance. Experimental results on the benchmark datasets demonstrate that our model outperforms the state-of-the-art methods on personality detection.



Paperid:2022
Authors:Shuaibo Hu, Kui Yu
School of Computer and Information, Hefei University of Technology, School of Computer and Information, Hefei University of Technology
Abstract:
Selective rationalization can be regarded as a straightforward selfexplaining approach for enhancing model explainability in natural language processing tasks. It aims to provide explanations that are more accessible and understandable to non-technical users by first selecting subsets of input texts as rationales and then predicting based on chosen subsets. However, existing methods that follow this select-then-predict framework may suffer from the rationalization degeneration problem, resulting in sub-optimal or unsatisfactory rationales that do not align with human judgments. This problem may further lead to rationalization failure, resulting in meaningless rationales that ultimately undermine people's trust in the rationalization model. To address these challenges, we propose a Guidance-based Rationalization method (G-RAT) that effectively improves robustness against failure situations and the quality of rationales by using a guidance module to regularize selections and distributions. Experimental results on two synthetic settings prove that our method is robust to the rationalization degeneration and failure problems, while the results on two real datasets show its effectiveness in providing rationales in line with human judgments. The source code is available at https://github.com/shuaibo919/g-rat.



Paperid:2023
Authors:Xinshuo Hu, Dongfang Li, Baotian Hu, Zihao Zheng, Zhenyu Liu, Min Zhang
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen
Abstract:
Large language models (LLMs) have been widely used in various applications but are known to suffer from issues related to untruthfulness and toxicity. While parameterefficient modules (PEMs) have demonstrated their effectiveness in equipping models with new skills, leveraging PEMs for deficiency unlearning remains underexplored. In this work, we propose a PEMs operation approach, namely Extraction-before-Subtraction (Ext-Sub), to enhance the truthfulness and detoxification of LLMs through the integration of ``expert'' PEM and ``anti-expert'' PEM. Remarkably, even anti-expert PEM possess valuable capabilities due to their proficiency in generating fabricated content, which necessitates language modeling and logical narrative competence. Rather than merely negating the parameters, our approach involves extracting and eliminating solely the deficiency capability within anti-expert PEM while preserving the general capabilities. To evaluate the effectiveness of our approach in terms of truthfulness and detoxification, we conduct extensive experiments on LLMs, encompassing additional abilities such as language modelling and mathematical reasoning. Our empirical results demonstrate that our approach effectively improves truthfulness and detoxification, while largely preserving the fundamental abilities of LLMs.



Paperid:2024
Authors:Xuming Hu, Zhaochen Hong, Yong Jiang, Zhichao Lin, Xiaobin Wang, Pengjun Xie, Philip S. Yu
AI Thrust, Hong Kong University of Science and Technology (Guangzhou) School of Software, Tsinghua University, School of Software, Tsinghua University, DAMO Academy, Alibaba Group, DAMO Academy, Alibaba Group, DAMO Academy, Alibaba Group, DAMO Academy, Alibaba Group, Department of Computer Science, University of Illinois Chicago
Abstract:
Crossdomain named entity recognition (NER) tasks encourage NER models to transfer knowledge from data-rich source domains to sparsely labeled target domains. Previous works adopt the paradigms of pre-training on the source domain followed by fine-tuning on the target domain. However, these works ignore that general labeled NER source domain data can be easily retrieved in the real world, and soliciting more source domains could bring more benefits. Unfortunately, previous paradigms cannot efficiently transfer knowledge from multiple source domains. In this work, to transfer multiple source domains' knowledge, we decouple the NER task into the pipeline tasks of mention detection and entity typing, where the mention detection unifies the training object across domains, thus providing the entity typing with higher-quality entity mentions. Additionally, we request multiple general source domain models to suggest the potential named entities for sentences in the target domain explicitly, and transfer their knowledge to the target domain models through the knowledge progressive networks implicitly. Furthermore, we propose two methods to analyze in which source domain knowledge transfer occurs, thus helping us judge which source domain brings the greatest benefit. In our experiment, we develop a Chinese cross-domain NER dataset. Our model improved the F1 score by an average of 12.50% across 8 Chinese and English datasets compared to models without source domain data.



Paperid:2025
Authors:Yuxue Hu, Junsong Li, Mingmin Wu, Zhongqiang Huang, Gang Chen, Ying Sha
College of Informatics, Huazhong Agricultural University, Wuhan, China. Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China, College of Informatics, Huazhong Agricultural University, Wuhan, China., College of Informatics, Huazhong Agricultural University, Wuhan, China., College of Informatics, Huazhong Agricultural University, Wuhan, China., Jointown Healthcare Technoloty Group, Wuhan, China., College of Informatics, Huazhong Agricultural University, Wuhan, China. Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China
Abstract:
Euphemisms are commonly used on social media and darknet marketplaces to evade platform regulations by masking their true meanings with innocent ones. For instance, “weed” is used instead of “marijuana” for illicit transactions. Thus, euphemism identification, i.e., mapping a given euphemism (“weed”) to its specific target word (“marijuana”), is essential for improving content moderation and combating underground markets. Existing methods employ selfsupervised schemes to automatically construct labeled training datasets for euphemism identification. However, they overlook the text-text domain gap caused by the discrepancy between the constructed training data and the test data, leading to performance deterioration. In this paper, we present the text-text domain gap and explain how it forms in terms of the data distribution and the cone effect. Moreover, to bridge this gap, we introduce a feature alignment network (FA-Net), which can both align the in-domain and cross-domain features, thus mitigating the domain gap from training data to test data and improving the performance of the base models for euphemism identification. We apply this FA-Net to the base models, obtaining markedly better results, and creating a state-of-the-art model which beats the large language models.



Paperid:2026
Authors:Zhiyuan Hu, Chumin Liu, Yue Feng, Anh Tuan Luu, Bryan Hooi
National University of Singapore, Nanyang Technological University, University College London, Nanyang Technological University, National University of Singapore
Abstract:
Controllable text generation is a challenging and meaningful field in natural language generation (NLG). Especially, poetry generation is a typical one with welldefined and strict conditions for text generation which is an ideal playground for the assessment of current methodologies. While prior works succeeded in controlling either semantic or metrical aspects of poetry generation, simultaneously addressing both remains a challenge. In this paper, we pioneer the use of the Diffusion model for generating sonnets and Chinese SongCi poetry to tackle such challenges. In terms of semantics, our PoetryDiffusion model, built upon the Diffusion model, generates entire sentences or poetry by comprehensively considering the entirety of sentence information. This approach enhances semantic expression, distinguishing it from autoregressive and large language models (LLMs). For metrical control, its constraint control module which can be trained individually enables us to flexibly incorporate a novel metrical controller to manipulate and evaluate metrics (format and rhythm). The denoising process in PoetryDiffusion allows for the gradual enhancement of semantics and flexible integration of the metrical controller which can calculate and impose penalties on states that stray significantly from the target control distribution. Experimental results on two datasets demonstrate that our model outperforms existing models in terms of automatic evaluation of semantic, metrical, and overall performance as well as human evaluation. Codes are released to https://github.com/ChorlingLau/PoetryDiffusion.



Paperid:2027
Authors:Chen Huang, Peixin Qin, Wenqiang Lei, Jiancheng Lv
Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence
Abstract:
One of the key factors in language productivity and human cognition is the ability of Systematic Compositionality, which refers to understanding composed, unseen examples of seen primitives. However, recent evidence reveals that the Transformers have difficulty in generalizing the composed context based on the seen primitives. To this end, we take the first step to propose a compositionalityaware Transformer called CAT and two novel pre-training tasks to facilitate the systematic compositionality. We tentatively provide a successful implementation of a multi-layer CAT on the basis of the especially popular BERT. The experimental results demonstrate that CAT outperforms baselines on compositionality-aware tasks with minimal impact on effectiveness on standardized language understanding tasks.



Paperid:2028
Authors:Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang
Beihang University, Beihang University, University of Ottawa, Southeast University
Abstract:
Current imagetext retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.



Paperid:2029
Authors:Jianheng Huang, Ante Wang, Linfeng Gao, Linfeng Song, Jinsong Su
School of Informatics, Xiamen University, China Shanghai Artificial Intelligence Laboratory, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, School of Informatics, Xiamen University, China Shanghai Artificial Intelligence Laboratory, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Tencent AI Lab, School of Informatics, Xiamen University, China Shanghai Artificial Intelligence Laboratory, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
Abstract:
Leveraging vast and continually updated knowledge from the Internet has been considered an important ability for a dialogue system. Therefore, the dialogue query generation task is proposed for generating search queries from dialogue histories, which will be submitted to a search engine for retrieving relevant websites on the Internet. In this regard, previous efforts were devoted to collecting conversations with annotated queries and training a query producer (QP) via standard supervised learning. However, these studies still face the challenges of data scarcity and domain adaptation. To address these issues, in this paper, we propose a semisupervised learning framework -- SemiDQG, to improve model performance with unlabeled conversations. Based on the observation that the search query is typically related to the topic of dialogue response, we train a response-augmented query producer (RA) to provide rich and effective training signals for QP. We first apply a similarity-based query selection strategy to select high-quality RA-generated pseudo queries, which are used to construct pseudo instances for training QP and RA. Then, we adopt the REINFORCE algorithm to further enhance QP, with RA-provided rewards as fine-grained training signals. Experimental results and in-depth analysis of three benchmarks show the effectiveness of our framework in cross-domain and low-resource scenarios. Particularly, SemiDQG significantly surpasses ChatGPT and competitive baselines. Our code is available at \url{https://github.com/DeepLearnXMU/SemiDQG}.



Paperid:2030
Authors:Jin Huang, Danfeng Yan, Yuanqiang Cai
Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
The promptbased method has been proven effective in improving the performance of pre-trained language models (PLMs) on sentence-level few-shot tasks. However, when applying prompting to token-level tasks such as Named Entity Recognition (NER), specific templates need to be designed, and all possible segments of the input text need to be enumerated. These methods have high computational complexity in both training and inference processes, making them difficult to apply in real-world scenarios. To address these issues, we redefine the NER task as a Machine Reading Comprehension (MRC) task and incorporate prompting into the MRC framework. Specifically, we sequentially insert boundary markers for various entity types into the templates and use these markers as anchors during the inference process to differentiate entity types. In contrast to the traditional multi-turn question-answering extraction in the MRC framework, our method can extract all spans of entity types in one round. Furthermore, we propose word-based template and example-based template that enhance the MRC framework's perception of entity start and end positions while significantly reducing the manual effort required for template design. It is worth noting that in cross-domain scenarios, PMRC does not require redesigning the model architecture and can continue training by simply replacing the templates to recognize entity types in the target domain. Experimental results demonstrate that our approach outperforms state-of-the-art models in low-resource settings, achieving an average performance improvement of +5.2% in settings where access to source domain data is limited. Particularly, on the ATIS dataset with a large number of entity types and 10-shot setting, PMRC achieves a performance improvement of +15.7%. Moreover, our method achieves a decoding speed 40.56 times faster than the template-based cloze-style approach.



Paperid:2031
Authors:Monika Jain, Raghava Mutharaju, Ramakanth Kavuluru, Kuldeep Singh
Indraprastha Institute of Information Technology, Delhi, India, Indraprastha Institute of Information Technology, Delhi, India, University of Kentucky, Lexington, Kentucky, United States, Cerence GmbH and Zerotha Research, Germany
Abstract:
Documentlevel relation extraction (DocRE) poses the challenge of identifying relationships between entities within a document. Existing approaches rely on logical reasoning or contextual cues from entities. This paper reframes document-level RE as link prediction over a Knowledge Graph (KG) with distinct benefits: 1) Our approach amalgamates entity context and document-derived logical reasoning, enhancing link prediction quality. 2) Predicted links between entities offer interpretability, elucidating employed reasoning. We evaluate our approach on benchmark datasets - DocRED, ReDocRED, and DWIE. The results indicate that our proposed method outperforms the state-of-the-art models and suggests that incorporating context-based Knowledge Graph link prediction techniques can enhance the performance of document-level relation extraction models.



Paperid:2032
Authors:Yejin Jeon, Yunsu Kim, Gary Geunbae Lee
POSTECH, aiXplain, Inc., POSTECH
Abstract:
Zeroshot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-à-vis alternative baseline models.



Paperid:2033
Authors:Bin Ji, Huijun Liu, Mingzhe Du, See-Kiong Ng
National University of Singapore, National University of Singapore, National University of Singapore, National University of Singapore
Abstract:
Previous studies disclose that Large Language Models (LLMs) suffer from hallucinations when generating texts, bringing a novel and challenging research topic to the public, which centers on enabling LLMs to generate texts with citations. Existing work exposes two limitations when using LLMs to generate answers to questions with provided documents: unsatisfactory answer correctness and poor citation quality. To tackle the above issues, we investigate using Chainof-Thought (CoT) to elicit LLMs’ ability to synthesize correct answers from multiple documents, as well as properly cite these documents. Moreover, we propose a Citation Insurance Mechanism, which enables LLMs to detect and cite those missing citations. We conduct experiments on the ALCE benchmark with six open-source LLMs. Experimental results demonstrate that: (1) the CoT prompting strategy significantly improves the quality of text generation with citations; (2) the Citation Insurance Mechanism delivers impressive gains in citation quality at a low cost; (3) our best approach performs comparably as previous best ChatGPT-based baselines. Extensive analyses further validate the effectiveness of the proposed approach.



Paperid:2034
Authors:Mengzhao Jia, Can Xie, Liqiang Jing
Shandong University, Shandong University, University of Texas at Dallas
Abstract:
Despite commendable achievements made by existing work, prevailing multimodal sarcasm detection studies rely more on textual content over visual information. It unavoidably induces spurious correlations between textual words and labels, thereby significantly hindering the models' generalization capability. To address this problem, we define the task of outof-distribution (OOD) multimodal sarcasm detection, which aims to evaluate models' generalizability when the word distribution is different in training and testing settings. Moreover, we propose a novel debiasing multimodal sarcasm detection framework with contrastive learning, which aims to mitigate the harmful effect of biased textual factors for robust OOD generalization. In particular, we first design counterfactual data augmentation to construct the positive samples with dissimilar word biases and negative samples with similar word biases. Subsequently, we devise an adapted debiasing contrastive learning mechanism to empower the model to learn robust task-relevant features and alleviate the adverse effect of biased words. Extensive experiments show the superiority of the proposed framework.



Paperid:2035
Authors:Shuoran Jiang, Qingcai Chen, Youcheng Pan, Yang Xiang, Yukang Lin, Xiangping Wu, Chuanyi Liu, Xiaobao Song
Harbin Institute of Technology (Shenzhen), Harbin Institute of Technology (Shenzhen) Peng Cheng Laboratory, Peng Cheng Laboratory, Peng Cheng Laboratory, Harbin Institute of Technology (Shenzhen), Harbin Institute of Technology (Shenzhen), Institute of Data Security, Harbin Institute of Technology (Shenzhen), Shenzhen, China Peng Cheng Laboratory, Institute of Data Security, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Abstract:
Lowering the memory requirement in fullparameter training on large models has become a hot research area. MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD), demonstrating excellent performance with the same GPU memory usage as inference. However, the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead. Moreover, without momentum regularization, MeZO shows severe over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate. This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation. Unlike existing adaptive momentum methods, we relocate momentum on simulated perturbation in stochastic gradient approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD. Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.



Paperid:2036
Authors:Renlong Jie, Xiaojun Meng, Xin Jiang, Qun Liu
Huawei Noah's Ark Lab Northwestern Polytechnical University, China, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab
Abstract:
Unsupervised extractive summarization is an important technique in information extraction and retrieval. Compared with supervised method, it does not require highquality human-labelled summaries for training and thus can be easily applied for documents with different types, domains or languages. Most of existing unsupervised methods including TextRank and PACSUM rely on graph-based ranking on sentence centrality. However, this scorer can not be directly applied in end-to-end training, and the positional-related prior assumption is often needed for achieving good summaries. In addition, less attention is paid to length-controllable extractor, where users can decide to summarize texts under particular length constraint. This paper introduces an unsupervised extractive summarization model based on a siamese network, for which we develop a trainable bidirectional prediction objective between the selected summary and the original document. Different from the centrality-based ranking methods, our extractive scorer can be trained in an end-to-end manner, with no other requirement of positional assumption. In addition, we introduce a differentiable length control module by approximating 0-1 knapsack solver for end-to-end length-controllable extracting. Experiments show that our unsupervised method largely outperforms the centrality-based baseline using a same sentence encoder. In terms of length control ability, via our trainable knapsack module, the performance consistently outperforms the strong baseline without utilizing end-to-end training. Human evaluation further evidences that our method performs the best among baselines in terms of relevance and consistency.



Paperid:2037
Authors:MinJun Kim, SeungWoo Song, YouHan Lee, Haneol Jang, KyungTae Lim
Hanbat National University, Hanbat National University, Kakao brain Corp, Hanbat National University, Seoul National University of Science and Technology
Abstract:
The current research direction in generative models, such as the recently developed GPT4, aims to find relevant knowledge information for multimodal and multilingual inputs to provide answers. Under these research circumstances, the demand for multilingual evaluation of visual question answering (VQA) tasks, a representative task of multimodal systems, has increased. Accordingly, we propose a bilingual outsideknowledge VQA (BOK-VQA) dataset in this study that can be extended to multilingualism. The proposed data include 17K images, 17K question-answer pairs for both Korean and English and 280K instances of knowledge information related to question-answer content. We also present a framework that can effectively inject knowledge information into a VQA system by pretraining the knowledge information of BOK-VQA data in the form of graph embeddings. Finally, through in-depth analysis, we demonstrated the actual effect of the knowledge information contained in the constructed training data on VQA.



Paperid:2038
Authors:James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird
Center for Integrated Cognition, Center for Integrated Cognition, Center for Integrated Cognition, Center for Integrated Cognition
Abstract:
Large language models (LLMs) offer significant promise as a knowledge source for task learning. Prompt engineering has been shown to be effective for eliciting knowledge from an LLM, but alone it is insufficient for acquiring relevant, situationally grounded knowledge for an embodied agent learning novel tasks. We describe a cognitiveagent approach, STARS, that extends and complements prompt engineering, mitigating its limitations and thus enabling an agent to acquire new task knowledge matched to its native language capabilities, embodiment, environment, and user preferences. The STARS approach is to increase the response space of LLMs and deploy general strategies, embedded within the autonomous agent, to evaluate, repair, and select among candidate responses produced by the LLM. We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77-94% task completion in one-shot learning without user oversight. The approach achieves 100% task completion when human oversight (such as an indication of preference) is provided. Further, the type of oversight largely shifts from explicit, natural language instruction to simple confirmation/discomfirmation of high-quality responses that have been vetted by the agent before presentation to a user.



Paperid:2039
Authors:Fanshuang Kong, Richong Zhang, Ziqiao Wang, Yongyi Mao
SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China, SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China, School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada, School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada
Abstract:
To date, a backbone of methods for unsupervised domain adaptation (UDA) involves learning labeldiscriminative features via a label classifier and domain-invariant features through a domain discriminator in an adversarial scheme. However, these methods lack explicit control for aligning the source data and target data within the same label class, degrading the classifier's performance in the target domain. In this paper, we propose PL-Mix, a pseudo label guided Mixup method based on adversarial prompt tuning. Specifically, our PL-Mix facilitates class-dependent alignment and can alleviate the impact of noisy pseudo-labels. We then theoretically justify that PL-Mix can improve the generalization for UDA. Extensive experiments of the comparison with existing models also demonstrate the effectiveness of PL-Mix.



Paperid:2040
Authors:Lingxing Kong, Jiuliang Wang, Zheng Ma, Qifeng Zhou, Jianbing Zhang, Liang He, Jiajun Chen
National Key Laboratory for Novel Software Technology, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University Institute for AI Industry Research (AIR), Tsinghua University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University
Abstract:
Documentlevel relation extraction aims to extract entity relations that span across multiple sentences. This task faces two critical issues: long dependency and mention selection. Prior works address the above problems from the textual perspective, however, it is hard to handle these problems solely based on text information. In this paper, we leverage video information to provide additional evidence for understanding long dependencies and offer a wider perspective for identifying relevant mentions, thus giving rise to a new task named Multimodal Document-level Relation Extraction (MDocRE). To tackle this new task, we construct a human-annotated dataset including documents and relevant videos, which, to the best of our knowledge, is the first document-level relation extraction dataset equipped with video clips. We also propose a hierarchical framework to learn interactions between different dependency levels and a textual-guided transformer architecture that incorporates both textual and video modalities. In addition, we utilize a mention gate module to address the mention-selection problem in both modalities. Experiments on our proposed dataset show that 1) incorporating video information greatly improves model performance; 2) our hierarchical framework has state-of-the-art results compared with both unimodal and multimodal baselines; 3) through collaborating with video information, our model better solves the long-dependency and mention-selection problems.



Paperid:2041
Authors:Taeyoon Kwon, Kai Tzu-iunn Ong, Dongjin Kang, Seungjun Moon, Jeong Ryong Lee, Dosik Hwang, Beomseok Sohn, Yongsik Sim, Dongha Lee, Jinyoung Yeo
Yonsei University, Yonsei University, Yonsei University, Yonsei University, Yonsei university, Yonsei University, Yonsei University, Yonsei University, Yonsei University, Yonsei University
Abstract:
Machine reasoning has made great progress in recent years owing to large language models (LLMs). In the clinical domain, however, most NLPdriven projects mainly focus on clinical classification or reading comprehension, and under-explore clinical reasoning for disease diagnosis due to the expensive rationale annotation with clinicians. In this work, we present a "reasoning-aware" diagnosis framework that rationalizes the diagnostic process via prompt-based learning in a time- and labor-efficient manner, and learns to reason over the prompt-generated rationales. Specifically, we address the clinical reasoning for disease diagnosis, where the LLM generates diagnostic rationales providing its insight on presented patient data and the reasoning path towards the diagnosis, namely Clinical Chain-of-Thought (Clinical CoT). We empirically demonstrate LLMs/LMs' ability of clinical reasoning via extensive experiments and analyses on both rationale generation and disease diagnosis in various settings. We further propose a novel set of criteria for evaluating machine-generated rationales' potential for real-world clinical settings, facilitating and benefiting future research in this area.



Paperid:2042
Authors:An Lao, Qi Zhang, Chongyang Shi, Longbing Cao, Kun Yi, Liang Hu, Duoqian Miao
Beijing Institute of Technology, Tongji University DeepBlue Academy of Sciences, Beijing Institute of Technology, Macquarie University, Beijing Institute of Technology, Tongji University DeepBlue Academy of Sciences, Tongji University
Abstract:
Multimodal content, such as mixing text with images, presents significant challenges to rumor detection in social media. Existing multimodal rumor detection has focused on mixing tokens among spatial and sequential locations for unimodal representation or fusing clues of rumor veracity across modalities. However, they suffer from less discriminative unimodal representation and are vulnerable to intricate location dependencies in the timeconsuming fusion of spatial and sequential tokens. This work makes the first attempt at multimodal rumor detection in the frequency domain, which efficiently transforms spatial features into the frequency spectrum and obtains highly discriminative spectrum features for multimodal representation and fusion. A novel Frequency Spectrum Representation and fUsion network (FSRU) with dual contrastive learning reveals the frequency spectrum is more effective for multimodal representation and fusion, extracting the informative components for rumor detection. FSRU involves three novel mechanisms: utilizing the Fourier transform to convert features in the spatial domain to the frequency domain, the unimodal spectrum compression, and the cross-modal spectrum co-selection module in the frequency domain. Substantial experiments show that FSRU achieves satisfactory multimodal rumor detection performance.



Paperid:2043
Authors:Khoi M. Le, Trinh Pham, Tho Quan, Anh Tuan Luu
VinAI Research, Vietnam Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam, Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam, Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam, Nanyang Technological University, Singapore
Abstract:
Paraphrases are texts that convey the same meaning while using different words or sentence structures. It can be used as an automatic data augmentation tool for many Natural Language Processing tasks, especially when dealing with lowresource languages, where data shortage is a significant problem. To generate a paraphrase in multilingual settings, previous studies have leveraged the knowledge from the machine translation field, i.e., forming a paraphrase through zero-shot machine translation in the same language. Despite good performance on human evaluation, those methods still require parallel translation datasets, thus making them inapplicable to languages that do not have parallel corpora. To mitigate that problem, we proposed the first unsupervised multilingual paraphrasing model, LAMPAT (Low-rank Adaptation for Multilingual Paraphrasing using Adversarial Training), by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence. Throughout the experiments, we found out that our method not only works well for English but can generalize on unseen languages as well. Data and code are available at https://github.com/phkhanhtrinh23/LAMPAT.



Paperid:2044
Authors:Thanh-Thien Le, Manh Nguyen, Tung Thanh Nguyen, Linh Ngo Van, Thien Huu Nguyen
VinAI Research, Vietnam, Hanoi University of Science and Technology, Vietnam, University of Michigan, USA, Hanoi University of Science and Technology, Vietnam, University of Oregon, USA
Abstract:
To build continual relation extraction (CRE) models, those can adapt to an evergrowing ontology of relations, is a cornerstone information extraction task that serves in various dynamic real-world domains. To mitigate catastrophic forgetting in CRE, existing state-of-the-art approaches have effectively utilized rehearsal techniques from continual learning and achieved remarkable success. However, managing multiple objectives associated with memory-based rehearsal remains underexplored, often relying on simple summation and overlooking complex trade-offs. In this paper, we propose Continual Relation Extraction via Sequential Multi-task Learning (CREST), a novel CRE approach built upon a tailored Multi-task Learning framework for continual learning. CREST takes into consideration the disparity in the magnitudes of gradient signals of different objectives, thereby effectively handling the inherent difference between multi-task learning and continual learning. Through extensive experiments on multiple datasets, CREST demonstrates significant improvements in CRE performance as well as superiority over other state-of-the-art Multi-task Learning frameworks, offering a promising solution to the challenges of continual learning in this domain.



Paperid:2045
Authors:Bo Li, Wei Ye, Quansen Wang, Wen Zhao, Shikun Zhang
Peking University, Peking University, Boston University, Peking University, Peking University
Abstract:
Textual label names (descriptions) are typically semantically rich in many natural language understanding (NLU) tasks. In this paper, we incorporate the prompting methodology, which is widely used to enrich model input, into the label side for the first time. Specifically, we propose a Mask Matching method, which equips an input with a prompt and its label with another, and then makes predictions by matching their mask representations. We evaluate our method extensively on 8 NLU tasks with 14 datasets. The experimental results show that Mask Matching significantly outperforms its counterparts of finetuning and conventional prompt-tuning, setting up state-of-the-art performances in several datasets. Mask Matching is particularly good at handling NLU tasks with large label counts and informative label names. As pioneering efforts that investigate the label-side prompt, we also discuss open issues for future study.



Paperid:2046
Authors:Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Fangfang Su, Fei Li, Donghong Ji
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China, School of Computing, National University of Singapore, Singapore, School of Computing and Information Systems, Singapore Management University, Singapore, College of Intelligence and Computing, Tianjin University, Tianjin, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Abstract:
Dialogue Aspectbased Sentiment Quadruple (DiaASQ) is a newly-emergent task aiming to extract the sentiment quadruple (i.e., targets, aspects, opinions, and sentiments) from conversations. While showing promising performance, the prior DiaASQ approach unfortunately falls prey to the key crux of DiaASQ, including insufficient modeling of discourse features, and lacking quadruple extraction, which hinders further task improvement. To this end, we introduce a novel framework that not only capitalizes on comprehensive discourse feature modeling, but also captures the intrinsic interaction for optimal quadruple extraction. On the one hand, drawing upon multiple discourse features, our approach constructs a token-level heterogeneous graph and enhances token interactions through a heterogeneous attention network. We further propose a novel triadic scorer, strengthening weak token relations within a quadruple, thereby enhancing the cohesion of the quadruple extraction. Experimental results on the DiaASQ benchmark showcase that our model significantly outperforms existing baselines across both English and Chinese datasets. Our code is available at https://bit.ly/3v27pqA.



Paperid:2047
Authors:Changmao Li, Jeffrey Flanigan
University of California, Santa Cruz, University of California, Santa Cruz
Abstract:
Large language models (LLMs) offer impressive performance in various zeroshot and few-shot tasks. However, their success in zero-shot or few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over datasets released over time, and over LLMs released over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that datasets released prior to the LLM training data creation date perform surprisingly better than datasets released post the LLM training data creation date. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, training data extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.



Paperid:2048
Authors:Chengzhengxu Li, Xiaoming Liu, Yichen Wang, Duyi Li, Yu Lan, Chao Shen
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
Promptbased pre-trained language models (PLMs) paradigm has succeeded substantially in few-shot natural language processing (NLP) tasks. However, prior discrete prompt optimization methods require expert knowledge to design the base prompt set and identify high-quality prompts, which is costly, inefficient, and subjective. Meanwhile, existing continuous prompt optimization methods improve the performance by learning the ideal prompts through the gradient information of PLMs, whose high computational cost, and low readability and generalizability are often concerning. To address the research gap, we propose a Dialogue-comprised Policy-gradient-based Discrete Prompt Optimization (DP_2O) method. We first design a multi-round dialogue alignment strategy for readability prompt set generation based on GPT-4. Furthermore, we propose an efficient prompt screening metric to identify high-quality prompts with linear complexity. Finally, we construct a reinforcement learning (RL) framework based on policy gradients to match the prompts to inputs optimally. By training a policy network with only 0.62M parameters on the tasks in the few-shot setting, DP_2O outperforms the state-of-the-art (SOTA) method by 1.52% in accuracy on average on four open-source datasets. Moreover, subsequent experiments also demonstrate that DP_2O has good universality, robustness and generalization ability.



Paperid:2049
Authors:Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He
Microsoft, Microsoft, Microsoft, Microsoft, Microsoft, Microsoft, Microsoft
Abstract:
Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another lessemphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost ($3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost ($46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.



Paperid:2050
Authors:Fangjun Li, David C. Hogg, Anthony G. Cohn
University of Leeds, University of Leeds, University of Leeds and the Alan Turing Institute
Abstract:
Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their humanlike text-generation capabilities. Despite these achievements, improving spatial reasoning remains a significant challenge for these models. Benchmarks like StepGame evaluate AI spatial reasoning, where ChatGPT has shown unsatisfactory performance. However, the presence of template errors in the benchmark has an impact on the evaluation results. Thus there is potential for ChatGPT to perform better if these template errors are addressed, leading to more accurate assessments of its spatial reasoning capabilities. In this study, we refine the StepGame benchmark, providing a more accurate dataset for model evaluation. We analyze GPT’s spatial reasoning performance on the rectified benchmark, identifying proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning. We provide a flawless solution to the benchmark by combining template-to-relation mapping with logic-based reasoning. This combination demonstrates proficiency in performing qualitative reasoning on StepGame without encountering any errors. We then address the limitations of GPT models in spatial reasoning. To improve spatial reasoning, we deploy Chain-of-Thought and Tree-of-thoughts prompting strategies, offering insights into GPT’s cognitive process. Our investigation not only sheds light on model deficiencies but also proposes enhancements, contributing to the advancement of AI with more robust spatial reasoning capabilities.



Paperid:2051
Authors:Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou
School of Electronic and Computer Engineering, Peking University, International Digital Economy Academy (IDEA), School of Electronic and Computer Engineering, Peking University, School of Electronic and Computer Engineering, Peking University, School of Electronic and Computer Engineering, Peking University, School of Electronic and Computer Engineering, Peking University
Abstract:
Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that exploiting easily available captions which describe general actions, i.e., auxiliary captions defined in our paper, will significantly boost the performance. To this end, we propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by NonAuxiliary Caption Suppression (NACS). To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between auxiliary captions and ground truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for constructing more negative pairs to maximize cross-modal mutual information. Extensive experiments on three public datasets (i.e., ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods.



Paperid:2052
Authors:Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, Mohit Bansal
University of North Carolina, Chapel Hill Amazon Alexa AI, Amazon Alexa AI, Amazon Alexa AI, University of North Carolina, Chapel Hill
Abstract:
Outdoor Visionand-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded non-repetitive navigation instructions, combined with an image rotation similarity based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigation agent when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.



Paperid:2053
Authors:Jiangnan Li, Yice Zhang, Shiwei Chen, Ruifeng Xu
Harbin Institute of Technology, Shenzhen; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen; Peng Cheng Laboratory, Harbin Institute of Technology, Shenzhen; Peng Cheng Laboratory, Harbin Institute of Technology, Shenzhen; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies; Peng Cheng Laboratory
Abstract:
Generative methods tackle MultiLabel Classification (MLC) by autoregressively generating label sequences. These methods excel at modeling label correlations and have achieved outstanding performance. However, a key challenge is determining the order of labels, as empirical findings indicate the significant impact of different orders on model learning and inference. Previous works adopt static label-ordering methods, assigning a unified label order for all samples based on label frequencies or co-occurrences. Nonetheless, such static methods neglect the unique semantics of each sample. More critically, these methods can cause the model to rigidly memorize training order, resulting in missing labels during inference. In light of these limitations, this paper proposes a dynamic label-order learning approach that adaptively learns a label order for each sample. Specifically, our approach adopts a difficulty-prioritized principle and iteratively constructs the label sequence based on the sample s semantics. To reduce the additional cost incurred by label-order learning, we use the same SEQ2SEQ model for label-order learning and MLC learning and introduce a unified loss function for joint optimization. Extensive experiments on public datasets reveal that our approach greatly outperforms previous methods. We will release our code at https: //github.com/KagamiBaka/DLOL.



Paperid:2054
Authors:Liang Li, Qingyuan Li, Bo Zhang, Xiangxiang Chu
Meituan, Meituan, Meituan, Meituan
Abstract:
As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.



Paperid:2055
Authors:Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang
School of Artificial Intelligence, Jilin University, Changchun, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China, School of Artificial Intelligence, Jilin University, Changchun, China Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
Abstract:
Visual question answering is a multimodal task that requires the joint comprehension of visual and textual information. However, integrating visual and textual semantics solely through attention layers is insufficient to comprehensively understand and align information from both modalities. Intuitively, object attributes can naturally serve as a bridge to unify them, which has been overlooked in previous research. In this paper, we propose a novel VQA approach from the perspective of utilizing object attribute, aiming to achieve better objectlevel visual-language alignment and multimodal scene understanding. Specifically, we design an attribute fusion module and a contrastive knowledge distillation module. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The enhanced object-level visual features contribute to solving fine-grained problem like counting-question. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness. Furthermore, to augment scene understanding and the out-of-distribution performance, the contrastive knowledge distillation module introduces a series of implicit knowledge. We distill knowledge into attributes through contrastive loss, which further strengthens the representation learning of attribute features and facilitates visual-linguistic alignment. Intensive experiments on six datasets, COCO-QA, VQAv2, VQA-CPv2, VQA-CPv1, VQAvs and TDIUC, show the superiority of the proposed method.



Paperid:2056
Authors:Shuang Li, Jiangjie Chen, Siyu Yuan, Xinyi Wu, Hao Yang, Shimin Tao, Yanghua Xiao
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, School of Data Science, Fudan University, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Huawei Translation Services Center, Huawei Translation Services Center, Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University Fudan-Aishu Cognitive Intelligence Joint Research Center
Abstract:
To translate well, machine translation (MT) systems and generalpurposed language models (LMs) need a deep understanding of both source and target languages and cultures. Therefore, idioms, with their non-compositional nature, pose particular challenges for Transformer-based systems, as literal translations often miss the intended meaning. Traditional methods, which replace idioms using existing knowledge bases (KBs), often lack scale and context-awareness. Addressing these challenges, our approach prioritizes context-awareness and scalability, allowing for offline storage of idioms in a manageable KB size. This ensures efficient serving with smaller models and provides a more comprehensive understanding of idiomatic expressions. We introduce a multilingual idiom KB (IdiomKB) developed using large LMs to address this. This KB facilitates better translation by smaller models, such as BLOOMZ (7.1B), Alpaca (7B), and InstructGPT (6.7B), by retrieving idioms' figurative meanings. We present a novel, GPT-4-powered metric for human-aligned evaluation, demonstrating that IdiomKB considerably boosts model performance. Human evaluations further validate our KB's quality.



Paperid:2057
Authors:Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, Jie Yu
National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology
Abstract:
Model editing techniques modify a minor proportion of knowledge in Large Language Models (LLMs) at a relatively low cost, which have demonstrated notable success. Existing methods assume Transformer Layer (TL) hidden states are values of keyvalue memories of the Feed-Forward Network (FFN). They usually optimize the TL hidden states to memorize target knowledge and use it to update the weights of the FFN in LLMs. However, the information flow of TL hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN, and residual connections. Existing methods neglect the fact that the TL hidden states contains information not specifically required for FFN. Consequently, the performance of model editing decreases. To achieve more precise model editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes certain general knowledge extraction patterns. This implies that MHSA weights do not require updating when new knowledge is introduced. Based on above findings, we introduce PMET, which simultaneously optimizes Transformer Component (TC, namely MHSA and FFN) hidden states, while only using the optimized TC hidden states of FFN to precisely update FFN weights. Our experiments demonstrate that PMET exhibits state-of-the-art performance on both the \textsc{counterfact} and zsRE datasets. Our ablation experiments substantiate the effectiveness of our enhancements, further reinforcing the finding that the MHSA encodes certain general knowledge extraction patterns and indicating its storage of a small amount of factual knowledge. Our code is available at \url{https://github.com/xpq-tech/PMET}.



Paperid:2058
Authors:Xue Li, Jia Su, Yang Yang, Zipeng Gao, Xinyu Duan, Yi Guan
Harbin Institute of Technology, Huawei Cloud, Harbin Institute of Technology, University of Science and Technology of China, Huawei Cloud, Harbin Institute of Technology
Abstract:
The generation of logically coherent dialogues by humans relies on underlying cognitive abilities. Based on this, we redefine the dialogue coherence evaluation process, combining cognitive judgment with the basic text to achieve a more humanlike evaluation. We propose a novel dialogue evaluation framework based on Dialogue Cognition Graph (DCGEval) to implement the fusion by in-depth interaction between cognition modeling and text modeling. The proposed Meaning Representation (AMR) based graph structure called DCG aims to uniformly model four dialogue cognitive abilities. Specifically, core-semantic cognition is modeled by converting the utterance into an AMR graph, which can extract essential semantic information without redundancy. The temporal and role cognition are modeled by establishing logical relationships among the different AMR graphs. Finally, the commonsense knowledge from ConceptNet is fused to express commonsense cognition. Experiments demonstrate the necessity of modeling human cognition for dialogue evaluation, and our DCGEval presents stronger correlations with human judgments compared to other state-of-the-art evaluation metrics.



Paperid:2059
Authors:Yangning Li, Shirong Ma, Xiaobin Wang, Shen Huang, Chengyue Jiang, Hai-Tao Zheng, Pengjun Xie, Fei Huang, Yong Jiang
SIGS, Tsinghua University PengCheng Laboratory, SIGS, Tsinghua University, DAMO Academy, Alibaba Group, DAMO Academy, Alibaba Group, Shanghaitech University, SIGS, Tsinghua University PengCheng Laboratory, DAMO Academy, Alibaba Group, DAMO Academy, Alibaba Group, DAMO Academy, Alibaba Group
Abstract:
Recently, instructionfollowing Large Language Models (LLMs) , represented by ChatGPT, have exhibited exceptional performance in general Natural Language Processing (NLP) tasks. However, the unique characteristics of E-commerce data pose significant challenges to general LLMs. An LLM tailored specifically for E-commerce scenarios, possessing robust cross-dataset/task generalization capabilities, is a pressing necessity. To solve this issue, in this work, we proposed the first E-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomInstruct scales up the data size and task diversity by constructing atomic tasks with E-commerce basic data types, such as product information, user reviews. Atomic tasks are defined as intermediate tasks implicitly involved in solving a final task, which we also call Chain-of-Task tasks. We developed EcomGPT with different parameter scales by training the backbone model BLOOMZ with the EcomInstruct. Benefiting from the fundamental semantic understanding capabilities acquired from the Chain-of-Task tasks, EcomGPT exhibits excellent zero-shot generalization capabilities. Extensive experiments and human evaluations demonstrate that EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks. The EcomGPT will be public at https://github.com/Alibaba-NLP/EcomGPT.



Paperid:2060
Authors:Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Bin Sun, Xinglin Wang, Heda Wang, Kan Li
Beijing Institute of Technology, Beijing Institute of Technology, Xiaohongshu Inc, Xiaohongshu Inc, Beijing Institute of Technology, Beijing Institute of Technology, Xiaohongshu Inc, Beijing Insitiute of Technology
Abstract:
Large Language Models (LLMs) have performed well on various reasoning tasks, but their inaccessibility and numerous parameters hinder wide application in practice. One promising way is distilling the reasoning ability from LLMs to small models by the generated chainof-thought reasoning paths. In some cases, however, LLMs may produce incorrect reasoning chains, especially when facing complex mathematical problems. Previous studies only transfer knowledge from positive samples and drop the synthesized data with wrong answers. In this work, we illustrate the merit of negative data and propose a model specialization framework to distill LLMs with negative samples besides positive ones. The framework consists of three progressive steps, covering from training to inference stages, to absorb knowledge from negative data. We conduct extensive experiments across arithmetic reasoning tasks to demonstrate the role of negative data in distillation from LLM.



Paperid:2061
Authors:Yucheng Li, Frank Guerin, Chenghua Lin
University of Surrey, University of Surrey, University of Manchester
Abstract:
Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pretrained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.



Paperid:2062
Authors:Zhenyu Li, Sunqi Fan, Yu Gu, Xiuxing Li, Zhichao Duan, Bowen Dong, Ning Liu, Jianyong Wang
Tsinghua University, Tsinghua University, The Ohio State University, University of Chinese Academy of Sciences Key Laboratory of Intelligent Information Processing Institute of Computing Technology, CAS, Tsinghua University, Tsinghua University, Shandong University, Tsinghua University
Abstract:
Knowledge base question answering (KBQA) is a critical yet challenging task due to the vast number of entities within knowledge bases and the diversity of natural language questions posed by users. Unfortunately, the performance of most KBQA models tends to decline significantly in realworld scenarios where high-quality annotated data is insufficient. To mitigate the burden associated with manual annotation, we introduce FlexKBQA by utilizing Large Language Models (LLMs) as program translators for addressing the challenges inherent in the few-shot KBQA task. Specifically, FlexKBQA leverages automated algorithms to sample diverse programs, such as SPARQL queries, from the knowledge base, which are subsequently converted into natural language questions via LLMs. This synthetic dataset facilitates training a specialized lightweight model for the KB. Additionally, to reduce the barriers of distribution shift between synthetic data and real user questions, FlexKBQA introduces an executionguided self-training method to iterative leverage unlabeled user questions. Furthermore, we explore harnessing the inherent reasoning capability of LLMs to enhance the entire framework. Consequently, FlexKBQA delivers substantial flexibility, encompassing data annotation, deployment, and being domain agnostic. Through extensive experiments on GrailQA, WebQSP, and KQA Pro, we observe that under the few-shot even the more challenging zero-shot scenarios, FlexKBQA achieves impressive results with a few annotations, surpassing all previous baselines and even approaching the performance of supervised models, achieving a remarkable 93% performance relative to the fully-supervised models. We posit that FlexKBQA represents a significant advancement towards exploring better integration of large and lightweight models. Code is available at https://github.com/leezythu/FlexKBQA.



Paperid:2063
Authors:Yaobo Liang, Quanzhi Zhu, Junhe Zhao, Nan Duan
Microsoft Research Asia, Microsoft Research Asia, Microsoft Research Asia, Microsoft Research
Abstract:
There are two primary approaches to addressing crosslingual transfer: multilingual pre-training, which implicitly aligns the hidden representations of various languages, and translate-test, which explicitly translates different languages into an intermediate language, such as English. Translate-test offers better interpretability compared to multilingual pre-training. However, it has lower performance than multilingual pre-training and struggles with word-level tasks due to translation altering word order. As a result, we propose a new Machine-created Universal Language (MUL) as an alternative intermediate language. MUL comprises a set of discrete symbols forming a universal vocabulary and a natural language to MUL translator for converting multiple natural languages to MUL. MUL unifies shared concepts from various languages into a single universal word, enhancing cross-language transfer. Additionally, MUL retains language-specific words and word order, allowing the model to be easily applied to word-level tasks. Our experiments demonstrate that translating into MUL yields improved performance compared to multilingual pre-training, and our analysis indicates that MUL possesses strong interpretability. The code is at: https://github.com/microsoft/Unicoder/tree/master/MCUL.



Paperid:2064
Authors:Ying-Jia Lin, Chun-Yi Lin, Chia-Jen Yeh, Yi-Ting Li, Yun-Yu Hu, Chih-Hao Hsu, Mei-Feng Lee, Hung-Yu Kao
Department of Computer Science and Information Engineering, National Cheng Kung University, Department of Computer Science and Information Engineering, National Cheng Kung University, Department of Computer Science and Information Engineering, National Cheng Kung University, Department of Computer Science and Information Engineering, National Cheng Kung University, Department of Computer Science and Information Engineering, National Cheng Kung University, Department of Computer Science and Information Engineering, National Cheng Kung University, Department of Computer Science and Information Engineering, National Cheng Kung University, Department of Computer Science and Information Engineering, National Cheng Kung University
Abstract:
We present CFEVER, a Chinese dataset designed for Fact Extraction and VERification. CFEVER comprises 30,012 manually created claims based on content in Chinese Wikipedia. Each claim in CFEVER is labeled as “Supports”, “Refutes”, or “Not Enough Info” to depict its degree of factualness. Similar to the FEVER dataset, claims in the “Supports” and “Refutes” categories are also annotated with corresponding evidence sentences sourced from single or multiple pages in Chinese Wikipedia. Our labeled dataset holds a Fleiss’ kappa value of 0.7934 for fiveway inter-annotator agreement. In addition, through the experiments with the state-of-the-art approaches developed on the FEVER dataset and a simple baseline for CFEVER, we demonstrate that our dataset is a new rigorous benchmark for factual extraction and verification, which can be further used for developing automated systems to alleviate human fact-checking efforts. CFEVER is available at https://ikmlab.github.io/CFEVER.



Paperid:2065
Authors:Chang Liu, Yuanhe Tian, Weidong Chen, Yan Song, Yongdong Zhang
University of Science and Technology of China, University of Washington, University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China
Abstract:
Radiology report generation (RRG) aims to automatically generate a freetext description from a specific clinical radiograph, e.g., chest X-Ray images. Existing approaches tend to perform RRG with specific models trained on the public yet limited data from scratch, where they often lead to inferior performance owing to the problem of inefficient capabilities in both aligning visual and textual features and generating informative reports accordingly. Currently, large language models (LLMs) offered a promising solution to text generation with their power in learning from big data, especially for cross-modal scenarios such as RRG. However, most existing LLMs are pre-trained on general data, and suffer from the same problem of conventional approaches caused by knowledge gap between general and medical domain if they are applied to RRG. Therefore in this paper, we propose an approach to bootstrapping LLMs for RRG with a in-domain instance induction and a coarse-to-fine decoding process. Specifically, the in-domain instance induction process learns to align the LLM to radiology reports from general texts through contrastive learning. The coarse-to-fine decoding performs a text elevating process for those reports from the ranker, further enhanced with visual features and refinement prompts. Experimental results on two prevailing RRG datasets, namely, IU X-Ray and MIMIC-CXR, demonstrate the superiority of our approach to previous state-of-the-art solutions. Further analyses illustrate that, for the LLM, the induction process enables it to better align with the medical domain and the coarse-to-fine generation allows it to conduct more precise text generation.



Paperid:2066
Authors:Han Liu, Siyang Zhao, Xiaotong Zhang, Feng Zhang, Wei Wang, Fenglong Ma, Hongyang Chen, Hong Yu, Xianchao Zhang
Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Peking University, Shenzhen MSU-BIT University, The Pennsylvania State University, Zhejiang Lab, Dalian University of Technology, Dalian University of Technology
Abstract:
Fewshot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples.



Paperid:2067
Authors:Jingping Liu, Mingchuan Zhang, Weichen Li, Chao Wang, Shuang Li, Haiyun Jiang, Sihang Jiang, Yanghua Xiao, Yunwen Chen
East China University of Science and Technology, Fudan University, Fudan University, Shanghai University, Fudan University, Tencent AI Lab, Fudan University, Fudan University, DataGrand Inc.
Abstract:
Much effort has been devoted to building multimodal knowledge graphs by visualizing entities on images, but ignoring the multi-modal information of the relation between entities. Hence, in this paper, we aim to construct a new large-scale multi-modal knowledge graph with triplet facts grounded on images that reflect not only entities but also their relations. To achieve this purpose, we propose a novel pipeline method, including triplet fact filtering, image retrieving, entity-based image filtering, relation-based image filtering, and image clustering. In this way, a multi-modal knowledge graph named ImgFact is constructed, which contains 247,732 triplet facts and 3,730,805 images. In experiments, the manual and automatic evaluations prove the reliable quality of our ImgFact. We further use the obtained images to enhance model performance on two tasks. In particular, the model optimized by our ImgFact achieves an impressive 8.38% and 9.87% improvement over the solutions enhanced by an existing multi-modal knowledge graph and VisualChatGPT on F1 of relation classification. We release ImgFact and its instructions at https://github.com/kleinercubs/ImgFact.



Paperid:2068
Authors:Linfeng Liu, Hongqiu Wu, Hai Zhao
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct potential spelling errors in a given sentence. Current stateof-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose Rephrasing Language Modeling (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-theart results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks.



Paperid:2069
Authors:Longxiang Liu, Xiuxing Li, Yang Feng
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences, Beijing, China, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) University of Chinese Academy of Sciences, Beijing, China
Abstract:
Taskoriented dialog systems have witnessed substantial progress due to conversational pre-training techniques. Yet, two significant challenges persist. First, most systems primarily utilize the latest turn's state label for the generator. This practice overlooks the comprehensive value of state labels in boosting the model's understanding for future generations. Second, an overreliance on generated policy often leads to error accumulation, resulting in suboptimal responses when adhering to incorrect actions. To combat these challenges, we propose turn-level multi-task objectives for the encoder. With the guidance of essential information from labeled intermediate states, we establish a more robust representation for both understanding and generation. For the decoder, we introduce an action tree-based scheduled sampling technique. Specifically, we model the hierarchical policy as trees and utilize the similarity between trees to sample negative policy based on scheduled sampling, hoping the model to generate invariant responses under perturbations. This method simulates potential pitfalls by sampling similar negative policy, bridging the gap between task-oriented dialog training and inference. Among methods without continual pre-training, our approach achieved state-of-the-art (SOTA) performance on the MultiWOZ dataset series and was also competitive with pre-trained SOTA methods.



Paperid:2070
Authors:Peipei Liu, Hong Li, Yimo Ren, Jie Liu, Shuaizong Si, Hongsong Zhu, Limin Sun
Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
Abstract:
Mining structured knowledge from tweets using named entity recognition (NER) can be beneficial for many downstream applications such as recommendation and intention under standing. With tweet posts tending to be multimodal, multimodal named entity recognition (MNER) has attracted more attention. In this paper, we propose a novel approach, which can dynamically align the image and text sequence and achieve the multilevel cross-modal learning to augment textual word representation for MNER improvement. To be specific, our framework can be split into three main stages: the first stage focuses on intra-modality representation learning to derive the implicit global and local knowledge of each modality, the second evaluates the relevance between the text and its accompanying image and integrates different grained visual information based on the relevance, the third enforces semantic refinement via iterative cross-modal interactions and co-attention. We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.



Paperid:2071
Authors:Qingyi Liu, Jinghui Qin, Wenxuan Ye, Hao Mou, Yuxuan He, Keze Wang
Sun Yat-sen University, Guangdong University of Technology, X-Era AI Co., Ltd., Datastory, Datastory, Sun Yat-sen University
Abstract:
Recently, arbitrary text style transfer (TST) has made significant progress with the paradigm of prompt learning. In this paradigm, researchers often design or search for a fixed prompt for any input. However, existing evidence shows that large language models (LLMs) are promptsensitive and it is sub-optimal to apply the same prompt to any input for downstream TST tasks. Besides, the prompts obtained by searching are often unreadable and unexplainable to humans. To address these issues, we propose an Adaptive Prompt Routing (APR) framework to adaptively route prompts from a human-readable prompt set for various input texts and given styles. Specifically, we first construct a candidate prompt set of diverse and human-readable prompts for the target style. This set consists of several seed prompts and their variants paraphrased by an LLM. Subsequently, we train a prompt routing model to select the optimal prompts efficiently according to inputs. The adaptively selected prompt can guide the LLMs to perform a precise style transfer for each input sentence while maintaining readability for humans. Extensive experiments on 4 public TST benchmarks over 3 popular LLMs (with parameter sizes ranging from 1.5B to 175B) demonstrate that our APR achieves superior style transfer performances, compared to the state-of-the-art prompt-based and fine-tuning methods. The source code is available at https://github.com/DwyaneLQY/APR



Paperid:2072
Authors:Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
Inner Mongolian University, Inner Mongolian University, ByteDance, ByteDance, Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China National University of Singapore, Singapore
Abstract:
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graphbased emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the target utterance. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity, and annotate additional emotional information on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. These evaluations also underscore the importance of comprehensive emotional annotations. Code and audio samples can be found at: https://github.com/walker-hyf/ECSS.



Paperid:2073
Authors:Yang Liu
Tianjin University
Abstract:
Many evaluation measures are used to evaluate social biases in masked language models (MLMs). However, we find that these previously proposed evaluation measures are lacking robustness in scenarios with limited datasets. This is because these measures are obtained by comparing the pseudolog-likelihood (PLL) scores of the stereotypical and anti-stereotypical samples using an indicator function. The disadvantage is the limited mining of the PLL score sets without capturing its distributional information. In this paper, we represent a PLL score set as a Gaussian distribution and use Kullback-Leibler (KL) divergence and Jensen–Shannon (JS) divergence to construct evaluation measures for the distributions of stereotypical and anti-stereotypical PLL scores. Experimental results on the publicly available datasets StereoSet (SS) and CrowS-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously.



Paperid:2074
Authors:Yonghao Liu, Lan Huang, Fausto Giunchiglia, Xiaoyue Feng, Renchu Guan
Jilin University, College of Computer Science and Technology, Jilin University, University of Trento, Jilin University, The Key Laboratory for Symbol Computation and Knowledge Engineering of the Ministry of Education College of Computer Science and Technology, Jilin University
Abstract:
Text classification occupies an important role in natural language processing and has many applications in real life. Short text classification, as one of its subtopics, has attracted increasing interest from researchers since it is more challenging due to its semantic sparsity and insufficient labeled data. Recent studies attempt to combine graph learning and contrastive learning to alleviate the above problems in short text classification. Despite their fruitful success, there are still several inherent limitations. First, the generation of augmented views may disrupt the semantic structure within the text and introduce negative effects due to noise permutation. Second, they ignore the clusteringfriendly features in unlabeled data and fail to further utilize the prior information in few valuable labeled data. To this end, we propose a novel model that utilizes improved Graph contrastIve learning for short text classiFicaTion (GIFT). Specifically, we construct a heterogeneous graph containing several component graphs by mining from an internal corpus and introducing an external knowledge graph. Then, we use singular value decomposition to generate augmented views for graph contrastive learning. Moreover, we employ constrained kmeans on labeled texts to learn clustering-friendly features, which facilitate cluster-oriented contrastive learning and assist in obtaining better category boundaries. Extensive experimental results show that GIFT significantly outperforms previous state-of-the-art methods. Our code can be found in https://github.com/KEAML-JLU/GIFT.



Paperid:2075
Authors:Yushan Liu, Zili Wang, Ruifeng Yuan
Fudan University, INF Technology (Shanghai) Co., Ltd., Hong Kong Polytechnic University
Abstract:
Queryfocused summarization (QFS) aims to summarize the source document(s) with regard to a specific aspect of information given in a query. It plays an important role in presenting users with a concise answer summary from a set of query-relevant documents retrieved by the information retrieval system. Nonetheless, the QFS research has long been hampered by the lack of adequate datasets in terms of both quality and quantity. In this paper, we introduce a large-scale multi-document query-focused summarization dataset, called QuerySum, which contains 27,041 data samples covering diverse topics and its quality is guaranteed through human verification. Unlike some previous QFS datasets constructed directly from the question answering datasets, 74% queries in our dataset are the challenging non-factoid What-, Why-, and How- questions. More importantly, we also provide a set of similar queries together with the corresponding summaries pairs for each query as the retrieved context, presenting a new feature of QuerySum. We aim to encourage research efforts in query intention understanding in the context of QFS. Leveraging QuerySum's depth, we propose a model for query-aware multi-document summarization and set a new QFS benchmark.



Paperid:2076
Authors:Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, Jie Zhou
Department of Electronic Engineering, Tsinghua University, Pattern Recognition Center, WeChat AI, Tencent Inc, China, Pattern Recognition Center, WeChat AI, Tencent Inc, China, Department of Electronic Engineering, Tsinghua University, Department of Electronic Engineering, Tsinghua University, Department of Electronic Engineering, Tsinghua University, Pattern Recognition Center, WeChat AI, Tencent Inc, China
Abstract:
Knowledge retrieval with multimodal queries plays a crucial role in supporting knowledge-intensive multi-modal applications. However, existing methods face challenges in terms of their effectiveness and training efficiency, especially when it comes to training and integrating multiple retrievers to handle multi-modal queries. In this paper, we propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases, even when trained with limited data. We retrieve knowledge via a two-step process: 1) generating knowledge clues related to the queries, and 2) obtaining the relevant document by searching databases using the knowledge clue. In particular, we first introduce an object-aware prefix-tuning technique to guide multi-grained visual learning. Then, we align multi-grained visual features into the textual feature space of the LLM, employing the LLM to capture cross-modal interactions. Subsequently, we construct instruction data with a unified format for model training. Finally, we propose the knowledge-guided generation strategy to impose prior constraints in the decoding steps, thereby promoting the generation of distinctive knowledge clues. Through experiments conducted on three benchmarks, we demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.



Paperid:2077
Authors:Da Luo, Yanglei Gan, Rui Hou, Run Lin, Qiao Liu, Yuxiang Cai, Wannian Gao
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Fewshot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framework is motivated by the insight that the diverse viewpoints conveyed through instance-label pairs capture incomplete yet complementary intrinsic textual semantics. Specifically, our framework involves a symmetrical contrastive objective that encompasses both sentence-anchored and label-anchored contrastive losses. By combining these two losses, the model establishes a robust and uniform representation space. This space effectively captures the reciprocal alignment of feature distributions among instances and relational facts, simultaneously enhancing the maximization of mutual information across diverse perspectives within the same relation. Experimental results demonstrate that our framework achieves significant performance enhancements compared to baseline models in downstream FSRE tasks. Furthermore, our approach exhibits superior adaptability to handle the challenges of domain shift and zero-shot relation extraction. Our code is available online at https://github.com/AONE-NLP/FSRE-SaCon.



Paperid:2078
Authors:Mingyu Derek Ma, Xiaoxuan Wang, Po-Nien Kung, P. Jeffrey Brantingham, Nanyun Peng, Wei Wang
UCLA, UCLA, UCLA, UCLA, UCLA, UCLA
Abstract:
Information extraction tasks such as event extraction require an indepth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extraction approaches that require minimal human labeling for real-world applications. Fine-tuning supervised models with synthesized training data would be a generalizable method, but the existing data generation methods either still rely on large-scale ground-truth data or cannot be applied to complicated IE tasks due to their poor performance. To address these challenges, we propose STAR, a data generation method that leverages Large Language Models (LLMs) to synthesize data instances given limited seed demonstrations, thereby boosting low-resource information extraction performance. Our approach involves generating target structures (Y) followed by generating passages (X), all accomplished with the aid of LLMs. We design fine-grained step-by-step instructions to obtain the initial data instances. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data. Human assessment of the data quality shows STAR-generated data exhibit higher passage quality and better align with the task definitions compared with the human-curated data.



Paperid:2079
Authors:Hieu Man, Franck Dernoncourt, Thien Huu Nguyen
University of Oregon, Adobe Research, University of Oregon VinAI Research
Abstract:
To understand event structures of documents, event causality identification (ECI) emerges as a crucial task, aiming to discern causal relationships among event mentions. The latest approach for ECI has introduced advanced deep learning models where transformerbased encoding models, complemented by enriching components, are typically leveraged to learn effective event context representations for causality prediction. As such, an important step for ECI models is to transform the event context representations into causal label representations to perform logits score computation for training and inference purposes. Within this framework, event context representations might encapsulate numerous complicated and noisy structures due to the potential long context between the input events while causal label representations are intended to capture pure information about the causal relations to facilitate score estimation. Nonetheless, a notable drawback of existing ECI models stems from their reliance on simple feed-forward networks to handle the complex context-to-label representation transformation process, which might require drastic changes in the representations to hinder the learning process. To overcome this issue, our work introduces a novel method for ECI where, instead abrupt transformations, event context representations are gradually updated to achieve effective label representations. This process will be done incrementally to allow filtering of irrelevant structures at varying levels of granularity for causal relations. To realize this, we present a diffusion model to learn gradual representation transition processes between context and causal labels. It operates through a forward pass for causal label representation noising and a reverse pass for reconstructing label representations from random noise. Our experiments on different datasets across multiple languages demonstrate the advantages of the diffusion model with state-of-the-art performance for ECI.



Paperid:2080
Authors:Hongli Mao, Xian-Ling Mao, Hanlin Tang, Yu-Ming Shang, Heyan Huang
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, Beijing University of Posts and Telecommunications, Beijing Institute of Technology
Abstract:
Named Entity Recognition (NER), which aims to identify the span and category of entities within text, is a fundamental task in natural language processing. Recent NER approaches have featured pretrained transformer-based models (e.g., BERT) as a crucial encoding component to achieve state-of-the-art performance. However, due to the length limit for input text, these models typically consider text at the sentence-level and cannot capture the long-range contextual dependency within a document. To address this issue, we propose a novel Span Graph Transformer (SGT) method for document-level NER, which constructs long-range contextual dependencies at both the token and span levels. Specifically, we first retrieve relevant contextual sentences in the document for each target sentence, and jointly encode them by BERT to capture token-level dependencies. Then, our proposed model extracts candidate spans from each sentence and integrates these spans into a document-level span graph, where nested spans within sentences and identical spans across sentences are connected. By leveraging the power of Graph Transformer and well-designed position encoding, our span graph can fully exploit span-level dependencies within the document. Extensive experiments on both resource-rich nested and flat NER datasets, as well as low-resource distantly supervised NER datasets, demonstrate that proposed SGT model achieves better performance than previous state-of-the-art models.



Paperid:2081
Authors:Emily McMilin
Independent Researcher
Abstract:
Modern language modeling tasks are often underspecified: for a given token prediction, many words may satisfy the user’s intent of producing natural language at inference time, however only one word will minimize the task’s loss function at training time. We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight blackbox evaluation methods, that we apply to gendered pronoun resolution tasks on a wide range of LLMs to 1) aid in the detection of inference-time task underspecification by exploiting 2) previously unreported gender vs. time and gender vs. location spurious correlations on LLMs with a range of A) sizes: from BERT-base to GPT-3.5, B) pre-training objectives: from masked & autoregressive language modeling to a mixture of these objectives, and C) training stages: from pre-training only to reinforcement learning from human feedback (RLHF). Code and open-source demos available at https://github.com/2dot71mily/uspec.



Paperid:2082
Authors:Ying Mo, Jian Yang, Jiahao Liu, Qifan Wang, Ruoyu Chen, Jingang Wang, Zhoujun Li
State Key Lab of Software Development Environment, Beihang University, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China, Meituan, Beijing, China, Meta AI, New York, United States, Beijing Information Science and Technology University, Beijing, China, Meituan, Beijing, China, State Key Lab of Software Development Environment, Beihang University, Beijing, China
Abstract:
Crosslingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (MCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of MCL-NER over prior data-driven and model-based approaches. It achieves a substantial increase of nearly +2.0 F1 scores across a broad spectrum and establishes itself as the new state-of-the-art performer.



Paperid:2083
Authors:Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao
Samsung R&D Institute India - Bangalore, Samsung R&D Institute India - Bangalore, Samsung R&D Institute India - Bangalore, Samsung R&D Institute India - Bangalore, Samsung R&D Institute India - Bangalore
Abstract:
Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables stepby-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness.



Paperid:2084
Authors:Alon Mor, Yonatan Belinkov, Benny Kimelfeld
Technion, Haifa, Israel, Technion, Haifa, Israel, Technion, Haifa, Israel
Abstract:
Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classifier to changes in the token. Aggregating local explanations over a dataset provides a global explanation of the model. Such aggregation aims to detect words with the most impact, giving valuable insights about the model, like what it has learned in training and which adversarial examples expose its weaknesses. However, standard aggregation methods bear a high computational cost: a naive implementation applies a costly algorithm to each token of each document, and hence, it is infeasible for a simple user running in the scope of a short analysis session. We devise techniques for accelerating the global aggregation of the Anchor algorithm. Specifically, our goal is to compute a set of topk words with the highest global impact according to different aggregation functions. Some of our techniques are lossless and some are lossy. We show that for a very mild loss of quality, we are able to accelerate the computation by up to 30 times, reducing the computation from hours to minutes. We also devise and study a probabilistic model that accounts for noise in the Anchor algorithm and diminishes the bias toward words that are frequent yet low in impact.



Paperid:2085
Authors:Zhaoxi Mu, Xinyu Yang, Sining Sun, Qing Yang
Xi'an Jiaotong University, Xi'an Jiaotong University, Du Xiaoman, Du Xiaoman
Abstract:
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a selfsupervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.



Paperid:2086
Authors:Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Khoi M. Le, Zhiyuan Hu, Cong-Duy Nguyen, See-Kiong Ng, Anh Tuan Luu
National University of Singapore, Nanyang Technological University, CMU, VinAI, National University of Singapore, Nanyang Technological University, National University of Singapore, Nanyang Technological University
Abstract:
Fully finetuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter’s low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ-PVLA framework through extensive experiments where READ-PVLA significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks.



Paperid:2087
Authors:Zhijie Nie, Richong Zhang, Zhongyuan Wang, Xudong Liu
SKLSDE, School of Computer Science and Engineering, Beihang University Shen Yuan Honors College, Beihang University, SKLSDE, School of Computer Science and Engineering, Beihang University Zhongguancun Laboratory, Beijing, SKLSDE, School of Computer Science and Engineering, Beihang University, SKLSDE, School of Computer Science and Engineering, Beihang University
Abstract:
Current methods for KnowledgeBased Question Answering (KBQA) usually rely on complex training techniques and model frameworks, leading to many limitations in practical applications. Recently, the emergence of In-Context Learning (ICL) capabilities in Large Language Models (LLMs) provides a simple and training-free semantic parsing paradigm for KBQA: Given a small number of questions and their labeled logical forms as demo examples, LLMs can understand the task intent and generate the logic form for a new question. However, current powerful LLMs have little exposure to logic forms during pre-training, resulting in a high format error rate. To solve this problem, we propose a code-style in-context learning method for KBQA, which converts the generation process of unfamiliar logical form into the more familiar code generation process for LLMs. Experimental results on three mainstream datasets show that our method dramatically mitigated the formatting error problem in generating logic forms while realizing a new SOTA on WebQSP, GrailQA, and GraphQ under the few-shot setting. The code and supplementary files are released at https://github.com/Arthurizijar/KB-Coder.



Paperid:2088
Authors:Jihong Ouyang, Zhiyao Yang, Silong Liang, Bing Wang, Yimeng Wang, Ximing Li
College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China, College of Computer Science and Technology, Jilin University, China, College of Computer Science and Technology, Jilin University, China Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University, China
Abstract:
Aspectbased sentiment analysis (ABSA), a fine-grained sentiment classification task, has received much attention recently. Many works investigate sentiment information through opinion words, such as "good'' and "bad''. However, implicit sentiment data widely exists in the ABSA dataset, whose sentiment polarity is hard to determine due to the lack of distinct opinion words. To deal with implicit sentiment, this paper proposes an ABSA method that integrates explicit sentiment augmentations (ABSA-ESA) to add more sentiment clues. We propose an ABSA-specific explicit sentiment generation method to create such augmentations. Specifically, we post-train T5 by rule-based data and employ three strategies to constrain the sentiment polarity and aspect term of the generated augmentations. We employ Syntax Distance Weighting and Unlikelihood Contrastive Regularization in the training procedure to guide the model to generate the explicit opinion words with the same polarity as the input sentence. Meanwhile, we utilize the Constrained Beam Search to ensure the augmentations are aspect-related. We test ABSA-ESA on two ABSA benchmarks. The results show that ABSA-ESA outperforms the SOTA baselines on implicit and explicit sentiment accuracy.



Paperid:2089
Authors:Siru Ouyang, Zhuosheng Zhang, Hai Zhao
University of Illinois Urbana-Champaign, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Recent years have witnessed an increasing interest in training machines with reasoning ability, which deeply relies on accurately and clearly presented clue forms. The clues are usually modeled as entityaware knowledge in existing studies. However, those entity-aware clues are primarily focused on commonsense, making them insufficient for tasks that require knowledge of temporary facts or events, particularly in logical reasoning for reading comprehension. To address this challenge, we are motivated to cover both commonsense and temporary knowledge clues hierarchically. Specifically, we propose a general formalism of knowledge units by extracting backbone constituents of the sentence, such as the subject-verb-object formed ``facts''. We then construct a supergraph on top of the fact units, allowing for the benefit of sentence-level (relations among fact groups) and entity-level interactions (concepts or actions inside a fact). Experimental results on logical reasoning benchmarks and dialogue modeling datasets show that our approach improves the baselines substantially, and it is general across backbone models. Code is available at https://github.com/ozyyshr/FocalReasoner.



Paperid:2090
Authors:Yu Pan, Ye Yuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang, Lifeng Shang, Xin Jiang, Qun Liu
Harbin Institute of Technology Shenzhen, Shenzhen, Guangdong, China, School of Computer Science, Peking University, Beijing, China Peking University-Anker Embodied AI Lab, Huawei Noah’s Ark Lab, Shenzhen, Guangdong, China, Cloud BU, Huawei Technologies, Harbin Institute of Technology Shenzhen, Shenzhen, Guangdong, China Pengcheng Laboratory, Shenzhen, China, School of Computer Science, Peking University, Beijing, China Peking University-Anker Embodied AI Lab, Huawei Noah’s Ark Lab, Shenzhen, Guangdong, China, Huawei Noah’s Ark Lab, Shenzhen, Guangdong, China, Huawei Noah’s Ark Lab, Shenzhen, Guangdong, China
Abstract:
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prepares lessons for expanding operations by learning highlayer functionality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.



Paperid:2091
Authors:Tianshuo Peng, Zuchao Li, Ping Wang, Lefei Zhang, Hai Zhao
Wuhan University, Wuhan University, Wuhan University, Wuhan University, Shanghai Jiao Tong University
Abstract:
Multimodal aspect-based sentiment analysis (MABSA) has recently attracted increasing attention. The span-based extraction methods, such as FSUIE, demonstrate strong performance in sentiment analysis due to their joint modeling of input sequences and target labels. However, previous methods still have certain limitations: (i) They ignore the difference in the focus of visual information between different analysis targets (aspect or sentiment). (ii) Combining features from uni-modal encoders directly may not be sufficient to eliminate the modal gap and can cause difficulties in capturing the image-text pairwise relevance. (iii) Existing span-based methods for MABSA ignore the pairwise relevance of target span boundaries. To tackle these limitations, we propose a novel framework called DQPSA. Specifically, our model contains a Prompt as Dual Query (PDQ) module that uses the prompt as both a visual query and a language query to extract prompt-aware visual information and strengthen the pairwise relevance between visual information and the analysis target. Additionally, we introduce an Energy-based Pairwise Expert (EPE) module that models the boundaries pairing of the analysis target from the perspective of an Energy-based Model. This expert predicts aspect or sentiment span based on pairwise stability. Experiments on three widely used benchmarks demonstrate that DQPSA outperforms previous approaches and achieves a new state-of-the-art performance. The code will be released at https://github.com/pengts/DQPSA.



Paperid:2092
Authors:Ruili Pu, Yang Li, Jun Zhao, Suge Wang, Deyu Li, Jian Liao, Jianxing Zheng
School of Computer and Information Technology, Shanxi University, Taiyuan, China, School of Finance, Shanxi University of Finance and Economics, Taiyuan, China, National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, China, School of Computer and Information Technology, Shanxi University, Taiyuan, China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, China, School of Computer and Information Technology, Shanxi University, Taiyuan, China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, China, School of Computer and Information Technology, Shanxi University, Taiyuan, China, School of Computer and Information Technology, Shanxi University, Taiyuan, China
Abstract:
Event Causality Extraction (ECE) aims to extract the causeeffect event pairs with their structured event information from plain texts. As far as we know, the existing ECE methods mainly focus on the correlation between arguments, without explicitly modeling the causal relationship between events, and usually design two independent frameworks to extract cause events and effect events, respectively, which cannot effectively capture the dependency between the subtasks. Therefore, we propose a joint multi-label extraction framework for ECE to alleviate the above limitations. In particular, 1) we design a heterogeneous-relation-aware graph module to learn the potential relationships between events and arguments, in which we construct the heterogeneous graph by taking the predefined event types and all the words in the sentence as nodes, and modeling three relationships of "event-event", "event-argument" and "argument-argument" as edges. 2) We also design a multi-channel label enhancing module to better learn the distributed representation of each label in the multi-label extraction framework, and further enhance the interaction between the subtasks by considering the preliminary results of cause-effect type identification and event argument extraction. The experimental results on the benchmark dataset ECE-CCKS show that our approach outperforms previous state-of-the-art methods, and that our model also performs well on the complex samples with multiple cause-effect event pairs.



Paperid:2093
Authors:Jingyuan Qi, Minqian Liu, Ying Shen, Zhiyang Xu, Lifu Huang
Virginia Tech, Virginia Tech, Virginia Tech, Virginia Tech, Virginia Tech
Abstract:
Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on wellstructured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge – MULTISCRIPT, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MULTISCRIPT covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MULTISCRIPT, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from large language models such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.



Paperid:2094
Authors:Zhen Qin, Yiran Zhong, Hui Deng
OpenNLPLab, Shanghai AI Lab TapTap, OpenNLPLab, Shanghai AI Lab, Northwestern Polytechnical University
Abstract:
Length extrapolation has attracted considerable attention recently since it allows transformers to be tested on longer sequences than those used in training. Previous research has shown that this property can be attained by using carefully designed Relative Positional Encodings (RPEs). While these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated. This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis. We discover that a transformer is certain to possess this property as long as the series that corresponds to the RPE's exponential converges. Two practices are derived from the conditions and examined in language modeling tasks on a variety of corpora. As a bonus from the conditions, we derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps. Extensive experiments are conducted on the Wikitext103, Books, Github, and WikiBook datasets to demonstrate the viability of our discovered conditions. We also compare TRF to Empirical Receptive Field (ERF) across different models, showing consistently matched trends on these datasets. Code is released at: https://github.com/OpenNLPLab/Rpe.



Paperid:2095
Authors:Jesse Roberts, Kyle Moore, Drew Wilenzick, Douglas Fisher
Vanderbilt University, Vanderbilt University, Cornell University, Vanderbilt University
Abstract:
The recent proliferation of research into transformer based natural language processing has led to a number of studies which attempt to detect the presence of humanlike cognitive behavior in the models. We contend that, as is true of human psychology, the investigation of cognitive behavior in language models must be conducted in an appropriate population of an appropriate size for the results to be meaningful. We leverage work in uncertainty estimation in a novel approach to efficiently construct experimental populations. The resultant tool, PopulationLM, has been made open source. We provide theoretical grounding in the uncertainty estimation literature and motivation from current cognitive work regarding language models. We discuss the methodological lessons from other scientific communities and attempt to demonstrate their application to two artificial population studies. Through population based experimentation we find that language models exhibit behavior consistent with typicality effects among categories highly represented in training. However, we find that language models don't tend to exhibit structural priming effects. Generally, our results show that single models tend to over estimate the presence of cognitive behaviors in neural models.



Paperid:2096
Authors:Jie Ruan, Xiao Pu, Mingqi Gao, Xiaojun Wan, Yuesheng Zhu
Peking University, Peking University, Peking University, Peking University, Peking University
Abstract:
Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and timeconsuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking. Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18\% top-ranked system recognition accuracy and ranks first or ranks second on 90.91\% of the human metrics with 0.83 overall inter-system ranking Kendall correlation. Code and data are publicly available online.



Paperid:2097
Authors:Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, William Yang Wang
Computational Linguistics, Heidelberg University, Germany, University of California, Santa Barbara, University of California, Santa Barbara, University of California, Santa Barbara, Computational Linguistics, Heidelberg University, Germany IWR, Heidelberg University, Germany, University of California, Santa Barbara
Abstract:
Incremental decision making in realworld environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation (VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve around 25% relative improvement in task completion over the previous state-of-the-art for two datasets.



Paperid:2098
Authors:Ziyu Shang, Wenjun Ke, Nana Xiu, Peng Wang, Jiajun Liu, Yanhui Li, Zhizhao Luo, Ke Ji
School of Computer Science and Engineering, Southeast University, School of Computer Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Cyber Science and Engineering, Southeast University, School of Computer Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University, State Key Laboratory for Novel Software Technology, Nanjing University, Beijing Institute of Computer Technology and Application, School of Computer Science and Engineering, Southeast University
Abstract:
Large language models (LLMs) have demonstrated impressive proficiency in information retrieval, while they are prone to generating incorrect responses that conflict with reality, a phenomenon known as intrinsic hallucination. The critical challenge lies in the unclear and unreliable fact distribution within LLMs trained on vast amounts of data. The prevalent approach frames the factual detection task as a questionanswering paradigm, where the LLMs are asked about factual knowledge and examined for correctness. However, existing studies primarily focused on deriving test cases only from several specific domains, such as movies and sports, limiting the comprehensive observation of missing knowledge and the analysis of unexpected hallucinations. To address this issue, we propose OntoFact, an adaptive framework for detecting unknown facts of LLMs, devoted to mining the ontology-level skeleton of the missing knowledge. Specifically, we argue that LLMs could expose the ontology-based similarity among missing facts and introduce five representative knowledge graphs (KGs) as benchmarks. We further devise a sophisticated ontology-driven reinforcement learning (ORL) mechanism to produce error-prone test cases with specific entities and relations automatically. The ORL mechanism rewards the KGs for navigating toward a feasible direction for unveiling factual errors. Moreover, empirical efforts demonstrate that dominant LLMs are biased towards answering Yes rather than No, regardless of whether this knowledge is included. To mitigate the overconfidence of LLMs, we leverage a hallucination-free detection (HFD) strategy to tackle unfair comparisons between baselines, thereby boosting the result robustness. Experimental results on 5 datasets, using 32 representative LLMs, reveal a general lack of fact in current LLMs. Notably, ChatGPT exhibits fact error rates of 51.6% on DBpedia and 64.7% on YAGO, respectively. Additionally, the ORL mechanism demonstrates promising error prediction scores, with F1 scores ranging from 70% to 90% across most LLMs. Compared to the exhaustive testing, ORL achieves an average recall of 80% while reducing evaluation time by 35.29% to 63.12%.



Paperid:2099
Authors:Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, Yanzhi Wang
Northeastern University, Northeastern University, Northeastern University, Northeastern University, Northeastern University, Oracle, Northeastern University, Northeastern University
Abstract:
Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then introduced to boost LLMs' ondevice efficiency. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance, while the activation is still not quantized. On the other hand, mainstream commodity edge devices still struggle to execute these sub-8-bit quantized networks effectively. In this paper, we propose Agile-Quant, an Activation-Guided quantization framework for faster Inference of popular Large Language Models (LLMs) on the Edge. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we leverage the activation-aware token pruning technique to reduce the outliers and the adverse impact on attentivity. Ultimately, we utilize the SIMD-based 4-bit multiplier and our efficient TRIP matrix multiplication to implement the accelerator for LLMs on the edge. We apply our framework on different scales of LLMs including LLaMA, OPT, and BLOOM with 4-bit or 8-bit for the activation and 4-bit for the weight quantization. Experiments show that Agile-Quant achieves simultaneous quantization of model weights and activations while maintaining task performance comparable to existing weight-only quantization methods. Moreover, in the 8- and 4-bit scenario, Agile-Quant achieves an on-device speedup of up to 2.55x compared to its FP16 counterparts across multiple edge devices, marking a pioneering advancement in this domain.



Paperid:2100
Authors:Dan Shi, Chaobin You, Jiantao Huang, Taihao Li, Deyi Xiong
Tianjin University, Tianjin University, Zhejiang Lab, Zhejiang Lab, Tianjin University
Abstract:
As an indispensable ingredient of intelligence, commonsense reasoning is crucial for large language models (LLMs) in realworld scenarios. In this paper, we propose CORECODE, a dataset that contains abundant commonsense knowledge manually annotated on dyadic dialogues, to evaluate the commonsense reasoning and commonsense conflict detection capabilities of Chinese LLMs. We categorize commonsense knowledge in everyday conversations into three dimensions: entity, event, and social interaction. For easy and consistent annotation, we standardize the form of commonsense knowledge annotation in open-domain dialogues as "domain: slot = value". A total of 9 domains and 37 slots are defined to capture diverse commonsense knowledge. With these pre-defined domains and slots, we collect 76,787 commonsense knowledge annotations from 19,700 dialogues through crowdsourcing. To evaluate and enhance the commonsense reasoning capability for LLMs on the curated dataset, we establish a series of dialogue-level reasoning and detection tasks, including commonsense knowledge filling, commonsense knowledge generation, commonsense conflict phrase detection, domain identification, slot identification, and event causal inference. A wide variety of existing open-source Chinese LLMs are evaluated with these tasks on our dataset. Experimental results demonstrate that these models are not competent to predict CORECODE's plentiful reasoning content, and even ChatGPT could only achieve 0.275 and 0.084 accuracy on the domain identification and slot identification tasks under the zero-shot setting. We release the data and codes of CORECODE at https://github.com/danshi777/CORECODE to promote commonsense reasoning evaluation and study of LLMs in the context of daily conversations.



Paperid:2101
Authors:Wenkai Shi, Wenbin An, Feng Tian, Yan Chen, Yaqiang Wu, Qianying Wang, Ping Chen
School of Automation Science and Engineering, Xi’an Jiaotong University National Engineering Laboratory for Big Data Analytics, School of Automation Science and Engineering, Xi’an Jiaotong University National Engineering Laboratory for Big Data Analytics, School of Computer Science and Technology, Xi’an Jiaotong University National Engineering Laboratory for Big Data Analytics, School of Computer Science and Technology, Xi’an Jiaotong University, Lenovo Research, Lenovo Research, Department of Engineering, University of Massachusetts Boston
Abstract:
Generalized Category Discovery (GCD) aims to recognize both known and novel categories in an unlabeled dataset by leveraging another labeled dataset with only known categories. Without considering knowledge transfer from known to novel categories, current methods usually perform poorly on novel categories due to the lack of corresponding supervision. To mitigate this issue, we propose a unified Knowledge Transfer Network (KTN), which solves two obstacles to knowledge transfer in GCD. First, the mixture of known and novel categories in unlabeled data makes it difficult to identify transfer candidates (i.e., samples with novel categories). For this, we propose an entropybased method that leverages knowledge in the pre-trained classifier to differentiate known and novel categories without requiring extra data or parameters. Second, the lack of prior knowledge of novel categories presents challenges in quantifying semantic relationships between categories to decide the transfer weights. For this, we model different categories with prototypes and treat their similarities as transfer weights to measure the semantic similarities between categories. On the basis of two treatments, we transfer knowledge from known to novel categories by conducting pre-adjustment of logits and post-adjustment of labels for transfer candidates based on the transfer weights between different categories. With the weighted adjustment, KTN can generate more accurate pseudo-labels for unlabeled data, which helps to learn more discriminative features and boost model performance on novel categories. Extensive experiments show that our method outperforms state-of-the-art models on all evaluation metrics across multiple benchmark datasets. Furthermore, different from previous clustering-based methods that can only work offline with abundant data, KTN can be deployed online conveniently with faster inference speed. Code and data are available at https://github.com/yibai-shi/KTN.



Paperid:2102
Authors:Lei Shu, Liangchen Luo, Jayakumar Hoskere, Yun Zhu, Yinxiao Liu, Simon Tong, Jindong Chen, Lei Meng
Google Research, Google Research, Google Research, Google Research, Google Research, Google Research, Google Research, Google Research
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in creative tasks such as storytelling and Email generation. However, as LLMs are primarily trained on final text results rather than intermediate revisions, it might be challenging for them to perform text rewriting tasks. Most studies in the rewriting tasks focus on a particular transformation type within the boundaries of single sentences. In this work, we develop new strategies for instruction tuning and reinforcement learning to better align LLMs for cross-sentence rewriting tasks using diverse wording and structures expressed through natural languages including 1) generating rewriting instruction data from Wiki edits and public corpus through instruction generation and chain-of-thought prompting; 2) collecting comparison data for reward model training through a new ranking function. To facilitate this research, we introduce OpenRewriteEval, a novel benchmark covers a wide variety of rewriting types expressed through natural language instructions. Our results show significant improvements over a variety of baselines.



Paperid:2103
Authors:Gopendra Vikram Singh, Mauajama Firdaus, Dushyant Singh Chauhan, Asif Ekbal, Pushpak Bhattacharyya
Department of Computer Science and Engineering, Indian Institute of Technology Patna, India, Department of Computing Science, University of Alberta, Canada, Department of Computer Science and Engineering, Indian Institute of Technology Patna, India, Department of Computer Science and Engineering, Indian Institute of Technology Patna, India, Indian Institute of Technology Bombay, India
Abstract:
Sarcasm is a widespread linguistic phenomenon that poses a considerable challenge to explain due to its subjective nature, absence of contextual cues, and rooted personal perspectives. Even though the identification of sarcasm has been extensively studied in dialogue analysis, merely detecting sarcasm falls short of enabling conversational systems to genuinely comprehend the underlying meaning of a conversation and generate fitting responses. It is imperative to not only detect sarcasm but also pinpoint its origination and the rationale behind the sarcastic expressions to capture its authentic essence. In this paper, we delve into the discourse structure of conversations infused with sarcasm and introduce a novel task Sarcasm Initiation and Reasoning in Conversations (SIRC). Embedded in a multimodal environment and involving a combination of both English and code-mixed interactions, the objective of the task is to discern the trigger or starting point of sarcasm. Additionally, the task involves producing a natural language explanation that rationalizes the satirical dialogues. To this end, we introduce Sarcasm Initiation and Reasoning Dataset (SIRD) to facilitate our task and provide sarcasm initiation annotations and reasoning. We develop a comprehensive model named Sarcasm Initiation and Reasoning Generation (SIRG), which is designed to encompass textual, audio, and visual representations. To achieve this, we introduce a unique shared fusion method that employs cross-attention mechanisms to seamlessly integrate these diverse modalities. Our experimental outcomes, conducted on the SIRC dataset, demonstrate that our proposed framework establishes a new benchmark for both sarcasm initiation and its reasoning generation in the context of multimodal conversations. The code and dataset can be accessed from https://www.iitp.ac.in/∼ai-nlp-ml resources.html#sarcasm-explain and https://github.com/GussailRaat/SIRG-Sarcasm-Initiation-and-Reasoning-Generation.



Paperid:2104
Authors:Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, Houfeng Wang
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group, Alibaba Group, National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Abstract:
Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trialand-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.



Paperid:2105
Authors:Rui Song, Fausto Giunchiglia, Yingji Li, Mingjie Tian, Hao Xu
School of Artificial Intelligence, Jilin University, Department of Information Engineering and Computer Science, University of Trento School of Artificial Intelligence, Jilin University College of Computer Science and Technology, Jilin University, College of Computer Science and Technology, Jilin University, School of Artificial Intelligence, Jilin University, School of Artificial Intelligence, Jilin University College of Computer Science and Technology, Jilin University Chongqing Research Institute, Jilin University
Abstract:
Crossdomain text classification aims to transfer models from label-rich source domains to label-poor target domains, giving it a wide range of practical applications. Many approaches promote cross-domain generalization by capturing domaininvariant features. However, these methods rely on unlabeled samples provided by the target domains, which renders the model ineffective when the target domain is agnostic. Furthermore, the models are easily disturbed by shortcut learning in the source domain, which also hinders the improvement of domain generalization ability. To solve the aforementioned issues, this paper proposes TACIT, a target domain agnostic feature disentanglement framework which adaptively decouples robust and unrobust features by Variational Auto-Encoders. Additionally, to encourage the separation of unrobust features from robust features, we design a feature distillation task that compels unrobust features to approximate the output of the teacher. The teacher model is trained with a few easy samples that are easy to carry potential unknown shortcuts. Experimental results verify that our framework achieves comparable results to state-of-the-art baselines while utilizing only source domain data.



Paperid:2106
Authors:Shezheng Song, Shan Zhao, ChengYu Wang, Tianwei Yan, Shasha Li, Xiaoguang Mao, Meng Wang
Hefei University of Technology, Hefei University of Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, National University of Defense Technology, Hefei University of Technology
Abstract:
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia, which plays a key role in many applications. However, existing methods suffer from shortcomings, including modality impurity such as noise in raw image and ambiguous textual entity representation, which puts obstacles to MEL. We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query, and the model learns the mapping from each query to the relevant entity from candidate entities. This paper introduces a dualway enhanced (DWE) framework for MEL: (1) our model refines queries with multimodal data and addresses semantic gaps using cross-modal enhancers between text and image information. Besides, DWE innovatively leverages fine-grained image attributes, including facial characteristic and scene feature, to enhance and refine visual features. (2)By using Wikipedia descriptions, DWE enriches entity semantics and obtains more comprehensive textual representation, which reduces between textual representation and the entities in KG. Extensive experiments on three public benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance, indicating the superiority of our model. The code is released on https://github.com/season1blue/DWE.



Paperid:2107
Authors:Sihan Song, Furao Shen, Jian Zhao
National Key Laboratory for Novel Software Technology, Nanjing University, China Department of Computer Science and Technology, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Electronic Science and Engineering, Nanjing University, China
Abstract:
Data augmentation has been widely used in lowresource NER tasks to tackle the problem of data sparsity. However, previous data augmentation methods have the disadvantages of disrupted syntactic structures, token-label mismatch, and requirement for external knowledge or manual effort. To address these issues, we propose Robust Prompt-based Data Augmentation (RoPDA) for low-resource NER. Based on pre-trained language models (PLMs) with continuous prompt, RoPDA performs entity augmentation and context augmentation through five fundamental augmentation operations to generate label-flipping and label-preserving examples. To optimize the utilization of the augmented samples, we present two techniques: self-consistency filtering and mixup. The former effectively eliminates low-quality samples with a bidirectional mask, while the latter prevents performance degradation arising from the direct utilization of labelflipping samples. Extensive experiments on three popular benchmarks from different domains demonstrate that RoPDA significantly improves upon strong baselines, and also outperforms state-of-the-art semi-supervised learning methods when unlabeled data is included.



Paperid:2108
Authors:Weihang Su, Qingyao Ai, Xiangsheng Li, Jia Chen, Yiqun Liu, Xiaolong Wu, Shengluan Hou
Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory, Beijing, China, DCST, Tsinghua University & Zhongguancun Laboratory, Beijing, China, DCST, Tsinghua University & Zhongguancun Laboratory, Beijing, China, DCST, Tsinghua University & Zhongguancun Laboratory, Beijing, China, DCST, Tsinghua University & Zhongguancun Laboratory, Beijing, China, Huawei Poisson Lab, Huawei Poisson Lab
Abstract:
With the development of deep learning and natural language processing techniques, pretrained language models have been widely used to solve information retrieval (IR) problems. Benefiting from the pre-training and fine-tuning paradigm, these models achieve state-of-the-art performance. In previous works, plain texts in Wikipedia have been widely used in the pre-training stage. However, the rich structured information in Wikipedia, such as the titles, abstracts, hierarchical heading (multi-level title) structure, relationship between articles, references, hyperlink structures, and the writing organizations, has not been fully explored. In this paper, we devise four pre-training objectives tailored for IR tasks based on the structured knowledge of Wikipedia. Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus by leveraging the human-edited structured data from Wikipedia. Experimental results on multiple IR benchmark datasets show the superior performance of our model in both zero-shot and fine-tuning settings compared to existing strong retrieval baselines. Besides, experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains compared to previous models, especially in scenarios where long text similarity matching is needed. The code is available at https://github.com/oneal2000/Wikiformer.



Paperid:2109
Authors:Zhenlin Su, Liyan Xu, Jin Xu, Jiangnan Li, Mingdu Huangfu
South China University of Technology, Tencent Inc., Pazhou Lab, Guangzhou South China University of Technology, Institute of Information Engineering, Chinese Academy of Sciences, South China University of Technology
Abstract:
Identifying speakers of quotations in narratives is an important task in literary analysis, with challenging scenarios including the outof-domain inference for unseen speakers, and non-explicit cases where there are no speaker mentions in surrounding context. In this work, we propose a simple and effective approach SIG, a generation-based method that verbalizes the task and quotation input based on designed prompt templates, which also enables easy integration of other auxiliary tasks that further bolster the speaker identification performance. The prediction can either come from direct generation by the model, or be determined by the highest generation probability of each speaker candidate. Based on our approach design, SIG supports out-of-domain evaluation, and achieves open-world classification paradigm that is able to accept any forms of candidate input. We perform both cross-domain evaluation and in-domain evaluation on PDNC, the largest dataset of this task, where empirical results suggest that SIG outperforms previous baselines of complicated designs, as well as the zero-shot ChatGPT, especially excelling at those hard non-explicit scenarios by up to 17% improvement. Additional experiments on another dataset WP further corroborate the efficacy of SIG.



Paperid:2110
Authors:Hongda Sun, Hongzhan Lin, Rui Yan
Gaoling School of Artificial Intelligence, Renmin University of China, Gaoling School of Artificial Intelligence, Renmin University of China, Gaoling School of Artificial Intelligence, Renmin University of China Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education
Abstract:
Electronic health records (EHRs) have become the foundation of machine learning applications in healthcare, while the utility of real patient records is often limited by privacy and security concerns. Synthetic EHR generation provides an additional perspective to compensate for this limitation. Most existing methods synthesize new records based on real EHR data, without consideration of different types of events in EHR data, which cannot control the event combinations in line with medical common sense. In this paper, we propose MSIC, a Multivisit health Status Inference model for Collaborative EHR synthesis to address these limitations. First, we formulate the synthetic EHR generation process as a probabilistic graphical model and tightly connect different types of events by modeling the latent health states. Then, we derive a health state inference method tailored for the multi-visit scenario to effectively utilize previous records to synthesize current and future records. Furthermore, we propose to generate medical reports to add textual descriptions for each medical event, providing broader applications for synthesized EHR data. For generating different paragraphs in each visit, we incorporate a multi-generator deliberation framework to collaborate the message passing of multiple generators and employ a two-phase decoding strategy to generate high-quality reports. Our extensive experiments on the widely used benchmarks, MIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results on the quality of synthetic data while maintaining low privacy risks.



Paperid:2111
Authors:Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, Kai Yu
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on precollected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research ability evaluation of LLMs. Comprehensive experiments on most advanced LLMs show that, although GPT-4 achieves SOTA performance compared to other LLMs, there is still substantial room for improvement, especially for dynamic questions. The codes and data are publicly available on https://github.com/OpenDFM/SciEval.



Paperid:2112
Authors:Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou
Hangzhou City University, Ohio State University, Zhejiang University, Pennsylvania State University
Abstract:
Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using taskspecific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE.



Paperid:2113
Authors:Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
NTT Human Informatics Laboratories, NTT Corporation Tohoku University, NTT Human Informatics Laboratories, NTT Corporation, NTT Human Informatics Laboratories, NTT Corporation, NTT Human Informatics Laboratories, NTT Corporation, Tohoku University
Abstract:
We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on realworld documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.



Paperid:2114
Authors:Yijun Tian, Huan Song, Zichen Wang, Haozhu Wang, Ziqing Hu, Fang Wang, Nitesh V. Chawla, Panpan Xu
University of Notre Dame, Amazon, Amazon, Amazon, Amazon, Amazon, University of Notre Dame, Amazon
Abstract:
Large language models (LLMs) have shown remarkable generalization capability with exceptional performance in various language modeling tasks. However, they still exhibit inherent limitations in precisely capturing and returning grounded knowledge. While existing work has explored utilizing knowledge graphs (KGs) to enhance language modeling via joint training and customized model architectures, applying this to LLMs is problematic owing to their large number of parameters and high computational cost. Therefore, how to enhance pretrained LLMs using grounded knowledge, e.g., retrieval-augmented generation, remains an open question. In this work, we propose Graph Neural Prompting (GNP), a novel plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs. GNP encompasses various designs, including a standard graph neural network encoder, a cross-modality pooling module, a domain projector, and a self-supervised link prediction objective. Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks across different LLM sizes and settings. Code is available at https://github.com/meettyj/GNP.



Paperid:2115
Authors:Geng Tu, Tian Xie, Bin Liang, Hongpeng Wang, Ruifeng Xu
Harbin Institute of Technology, Shenzhen, China Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China The Chinese University of Hong Kong, Harbin Institute of Technology, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies Peng Cheng Laboratory, Shenzhen, China
Abstract:
Multimodal Emotion Recognition in Conversations (ERC) aims to identify the emotions conveyed by each utterance in a conversational video. Current efforts encounter challenges in balancing intraand inter-speaker context dependencies when tackling intra-modal interactions. This balance is vital as it encompasses modeling self-dependency (emotional inertia) where speakers' own emotions affect them and modeling interpersonal dependencies (empathy) where counterparts' emotions influence a speaker. Furthermore, challenges arise in addressing cross-modal interactions that involve content with conflicting emotions across different modalities. To address this issue, we introduce an adaptive interactive graph network (IGN) called AdaIGN that employs the Gumbel Softmax trick to adaptively select nodes and edges, enhancing intra- and cross-modal interactions. Unlike undirected graphs, we use a directed IGN to prevent future utterances from impacting the current one. Next, we propose Node- and Edge-level Selection Policies (NESP) to guide node and edge selection, along with a Graph-Level Selection Policy (GSP) to integrate the utterance representation from original IGN and NESP-enhanced IGN. Moreover, we design a task-specific loss function that prioritizes text modality and intra-speaker context selection. To reduce computational complexity, we use pre-defined pseudo labels through self-supervised methods to mask unnecessary utterance nodes for selection. Experimental results show that AdaIGN outperforms state-of-the-art methods on two popular datasets. Our code will be available at https://github.com/TuGengs/AdaIGN.



Paperid:2116
Authors:Qizhi Wan, Changxuan Wan, Keli Xiao, Kun Lu, Chenliang Li, Xiping Liu, Dexi Liu
Jiangxi University of Finance and Economics Jiangxi Key Lab of Data and Knowledge Engineering, Jiangxi University of Finance and Economics Jiangxi Key Lab of Data and Knowledge Engineering, Stony Brook University, The University of Oklahoma, Wuhan University, Jiangxi University of Finance and Economics Jiangxi Key Lab of Data and Knowledge Engineering, Jiangxi University of Finance and Economics Jiangxi Key Lab of Data and Knowledge Engineering
Abstract:
Existing models on event detection share threefold limitations, including (1) insufficient consideration of the structures between dependency relations, (2) limited exploration of the directed-edge semantics, and (3) issues in strengthening the event core arguments. To tackle these problems, we propose a dependency structure-enhanced event detection framework. In addition to the traditional token dependency parsing tree, denoted as TDG, our model considers the dependency edges in it as new nodes and constructs a dependency relation graph (DRG). DRG allows the embedding representations of dependency relations to be updated as nodes rather than edges in a graph neural network. Moreover, the levels of core argument nodes in the two graphs are adjusted by dependency relation types in TDG to enhance their status. Subsequently, the two graphs are further encoded and jointly trained in graph attention networks (GAT). Importantly, we design an interaction strategy of node embedding for the two graphs and refine the attention coefficient computational method to encode the semantic meaning of directed edges. Extensive experiments are conducted to validate the effectiveness of our method, and the results confirm its superiority over the state-of-the-art baselines. Our model outperforms the best benchmark with the F1 score increased by 3.5 and 3.4 percentage points on ACE2005 English and Chinese corpus.



Paperid:2117
Authors:Chenglong Wang, Hang Zhou, Yimin Hu, Yifu Huo, Bei Li, Tongran Liu, Tong Xiao, Jingbo Zhu
School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China, CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China, School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China
Abstract:
Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of longterm rewards (e.g., BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (e.g., a vocabulary) and a long action sequence (e.g., a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/examples/esrl.



Paperid:2118
Authors:Dingzirui Wang, Longxu Dou, Wenbin Zhang, Junyu Zeng, Wanxiang Che
Harbin Institute of Technology, Harbin Institute of Technology, Yunfu Technology (Beijing) Co., Ltd., Yunfu Technology (Beijing) Co., Ltd., Harbin Institute of Technology
Abstract:
Numerical reasoning is a vital capability for natural language processing models to understand and process numerical information in realworld scenarios. Most current methods first generate the Intermediate Meaning Representations (IMRs) of questions and then generate answers. Current SOTA methods generate programs as IMRs with large language models (LLMs). Intuitively, equations have fewer restrictions and closer semantics to the question than programs, leading to higher generation accuracy. However, current LLMs generate equations worse than programs, where we assume that the equation data is rare in pre-training data compared to programs. So in this paper, we try to use equations as IMRs to solve the numerical reasoning task by addressing two problems: (1) Theoretically, how to prove that the equation is an IMR with higher generation accuracy than programs; (2) Empirically, how to improve the generation accuracy of equations with LLMs. For the first problem, we propose and prove a proposition to theoretically compare the generation accuracy of different IMRs. For the second problem, we present a method called Boosting Numerical ReasonIng by Decomposing the Generation of Equations Bridge, which can improve the accuracy of LLMs in generating equations as IMRs by reducing the tendency of generating constant expressions and programs. Our method improves the performance by 2.2%, 0.9%, and 1.7% on GSM8K, SVAMP, and Algebra datasets compared to the previous state-of-the-art methods under the single reasoning path setting. Our code and prompts are available at https://github.com/zirui-HIT/Bridge_for_Numerical_Reasoning}.



Paperid:2119
Authors:Haochun Wang, Sendong Zhao, Chi Liu, Nuwa Xi, MuZhen Cai, Bing Qin, Ting Liu
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
Promptbased classification adapts tasks to a cloze question format utilizing the [MASK] token and the filled tokens are then mapped to labels through pre-defined verbalizers. Recent studies have explored the use of verbalizer embeddings to reduce labor in this process. However, all existing studies require a tuning process for either the pre-trained models or additional trainable embeddings. Meanwhile, the distance between high-dimensional verbalizer embeddings should not be measured by Euclidean distance due to the potential for non-linear manifolds in the representation space. In this study, we propose a tuning-free manifold-based space re-embedding method called Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for verbalizer embeddings, which preserves local properties within the same class as guidance for classification. Experimental results indicate that even without tuning any parameters, our LLE-INC is on par with automated verbalizers with parameter tuning. And with the parameter updating, our approach further enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification approach for the hyper-scale language models.



Paperid:2120
Authors:Jiaan Wang, JIanfeng Qu, Kexin Wang, Zhixu Li, Wen Hua, Ximing Li, An Liu
Soochow University, Soochow University, Soochow University, Fudan University, The Hong Kong Polytechnic University, Jilin University, Soochow University
Abstract:
Knowledgegrounded dialogue (KGD) learns to generate an informative response based on a given dialogue context and external knowledge (e.g., knowledge graphs; KGs). Recently, the emergence of large language models (LLMs) and pre-training techniques has brought great success to knowledge-grounded dialogue. However, when building KGD systems in real applications, there are various real-world noises that are inevitable to face. For example, the dialogue context might involve perturbations such as misspellings and abbreviations. In addition, KGs typically suffer from incompletion and also might contain erroneous and outdated facts. Such real-world noises pose a challenge to the robustness of KGD systems and hinder their applications in the real world. In this paper, we propose an entity-based contrastive learning framework for improving the robustness of KGD. Specifically, we make use of the entity information in a KGD sample to create both its positive and negative samples which involve semantic-irrelevant and semantic-relevant perturbations, respectively. The contrastive learning framework ensures the KGD model is aware of these two types of perturbations, thus could generate informative responses with the potentially noisy inputs in real applications. Experimental results on three widely-used benchmark datasets show that our method achieves new state-of-the-art performance in terms of automatic evaluation scores, verifying its effectiveness and potentiality. Furthermore, we show that our method is able to generate better responses than comparison models in both the noisy and the few-shot settings.



Paperid:2121
Authors:Jiadong Wang, Zexu Pan, Malu Zhang, Robby T. Tan, Haizhou Li
National University of Singapore The Chinese University of Hong Kong, Shenzhen, National University of Singapore, University of Electronic Science and Technology of China, National University of Singapore, The Chinese University of Hong Kong, Shenzhen National University of Singapore
Abstract:
Prior studies on audiovisual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments.



Paperid:2122
Authors:Ke Wang, Xiutian Zhao, Wei Peng
Huawei IT Innovation and Research Center, Huawei IT Innovation and Research Center, Huawei IT Innovation and Research Center
Abstract:
Existing methods aligning language models with various human needs are reliant heavily on highquality and task-specific data. However, industrial deployment of task-specific language models often encounter challenges in the availability of appropriate training samples. Taking meeting summarization for instance, public datasets are scarce, and private corpora are also hard to obtain due to privacy issues or resource-demanding annotation. To improve meeting summarization in the absence of positively-rated (i.e., ``good'') samples, we propose Score Tuning, a cold start tuning framework that leverages bad samples of distinguishable degrees to incrementally enhance the performance of summary generation without an initial presence of good samples. Our method utilizes asynchronous and numerical human feedback that measure the quality of generated summaries. Formulating data into triplets of (transcript, summary, score), our approach instructs a pre-trained model to learn the association between summary qualities and human-rated scores and hence to generate better summaries corresponding to higher scores. The experiment results show that our method is effective in improving meeting summarization on both English and Chinese corpora while requiring less annotated data and training resources compared to existing alignment methods. Additionally, we also preliminarily explore the transferability of our approach in machine translation tasks and demonstrate its potential for future development and usage in other domains.



Paperid:2123
Authors:Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, Heng Tao Shen
Beijing Forestry University, China Singapore Management University, Singapore, University of Electronic Science and Technology of China, China, University of Electronic Science and Technology of China, China, University of Electronic Science and Technology of China, China, Beijing Forestry University, China, Beijing Rongda Technology Co., Ltd., China, University of Electronic Science and Technology of China, China
Abstract:
Large Language Models (LLMs) have recently demonstrated exceptional performance in various Natural Language Processing (NLP) tasks. They have also shown the ability to perform chainof-thought (CoT) reasoning to solve complex problems. Recent studies have explored CoT reasoning in complex multimodal scenarios, such as the science question answering task, by fine-tuning multimodal models with high-quality human-annotated CoT rationales. However, collecting high-quality COT rationales is usually time-consuming and costly. Besides, the annotated rationales are hardly accurate due to the external essential information missed. To address these issues, we propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. The T-SciQ approach generates high-quality CoT rationales as teaching signals and is advanced to train much smaller models to perform CoT reasoning in complex modalities. Additionally, we introduce a novel data mixing strategy to produce more effective teaching data samples for simple and complex science question answer problems. Extensive experimental results show that our T-SciQ method achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%. Moreover, our approach outperforms the most powerful fine-tuned baseline by 4.5%. The code is publicly available at https://github.com/T-SciQ/T-SciQ.



Paperid:2124
Authors:Shiqi Wang, Yeqin Zhang, Cam-Tu Nguyen
National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University, National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University
Abstract:
In opendomain Question Answering (QA), dense text retrieval is crucial for finding relevant passages to generate answers. Typically, contrastive learning is used to train a retrieval model, which maps passages and queries to the same semantic space, making similar ones closer and dissimilar ones further apart. However, training such a system is challenging due to the false negative problem, where relevant passages may be missed during data annotation. Hard negative sampling, commonly used to improve contrastive learning, can introduce more noise in training. This is because hard negatives are those close to a given query, and thus more likely to be false negatives. To address this, we propose a novel contrastive confidence regularizer for Noise Contrastive Estimation (NCE) loss, a commonly used contrastive loss. Our analysis shows that the regularizer helps make the dense retrieval model more robust against false negatives with a theoretical guarantee. Additionally, we propose a model-agnostic method to filter out noisy negative passages in the dataset, improving any downstream dense retrieval models. Through experiments on three datasets, we demonstrate that our method achieves better retrieval performance in comparison to existing state-of-the-art dense retrieval systems.



Paperid:2125
Authors:Xinghao Wang, Junliang He, Pengyu Wang, Yunhua Zhou, Tianxiang Sun, Xipeng Qiu
School of Computer Science, Fudan University, School of Computer Science, Fudan University, School of Computer Science, Fudan University, School of Computer Science, Fudan University, School of Computer Science, Fudan University, School of Computer Science, Fudan University
Abstract:
Contrastivelearning-based methods have dominated sentence representation learning. These methods regularize the representation space by pulling similar sentence representations closer and pushing away the dissimilar ones and have been proven effective in various NLP tasks, e.g., semantic textual similarity (STS) tasks. However, it is challenging for these methods to learn fine-grained semantics as they only learn from the inter-sentence perspective, i.e., their supervision signal comes from the relationship between data samples. In this work, we propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks, standing up well in comparison to contrastive-learning-based methods. Notably, the proposed intra-sentence denoising objective complements existing inter-sentence contrastive methodologies and can be integrated with them to further enhance performance. Our code is available at https://github.com/xinghaow99/DenoSent.



Paperid:2126
Authors:Yan Wang, Zhixuan Chu, Xin Ouyang, Simeng Wang, Hongyan Hao, Yue Shen, Jinjie Gu, Siqiao Xue, James Zhang, Qing Cui, Longfei Li, Jun Zhou, Sheng Li
Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, Ant Group, University of Virginia
Abstract:
Recommendation systems aim to provide users with relevant suggestions, but often lack interpretability and fail to capture higherlevel semantic relationships between user behaviors and profiles. In this paper, we propose a novel approach that leverages large language models (LLMs) to construct personalized reasoning graphs. These graphs link a user's profile and behavioral sequences through causal and logical inferences, representing the user's interests in an interpretable way. Our approach, LLM reasoning graphs (LLMRG), has four components: chained graph reasoning, divergent extension, self-verification and scoring, and knowledge base self-improvement. The resulting reasoning graph is encoded using graph neural networks, which serves as additional input to improve conventional recommender systems, without requiring extra user or item information. Our approach demonstrates how LLMs can enable more logical and interpretable recommender systems through personalized reasoning graphs. LLMRG allows recommendations to benefit from both engineered recommendation systems and LLM-derived reasoning graphs. We demonstrate the effectiveness of LLMRG on benchmarks and real-world scenarios in enhancing base recommendation models.



Paperid:2127
Authors:Ye Wang, Huazheng Pan, Tao Zhang, Wen Wu, Wenxin Hu
East China Normal University, East China Normal University, Tsinghua University, East China Normal University, East China Normal University
Abstract:
The goal of documentlevel relation extraction (RE) is to identify relations between entities that span multiple sentences. Recently, incomplete labeling in document-level RE has received increasing attention, and some studies have used methods such as positive-unlabeled learning to tackle this issue, but there is still a lot of room for improvement. Motivated by this, we propose a positive-augmentation and positive-mixup positive-unlabeled metric learning framework (P3M). Specifically, we formulate document-level RE as a metric learning problem. We aim to pull the distance closer between entity pair embedding and their corresponding relation embedding, while pushing it farther away from the none-class relation embedding. Additionally, we adapt the positive-unlabeled learning to this loss objective. In order to improve the generalizability of the model, we use dropout to augment positive samples and propose a positive-none-class mixup method. Extensive experiments show that P3M improves the F1 score by approximately 4-10 points in document-level RE with incomplete labeling, and achieves state-of-the-art results in fully labeled scenarios. Furthermore, P3M has also demonstrated robustness to prior estimation bias in incomplete labeled scenarios.



Paperid:2128
Authors:Yu Wang, Nedim Lipka, Ryan A. Rossi, Alexa Siu, Ruiyi Zhang, Tyler Derr
Vanderbilt university, Adobe Research, Adobe Research, Adobe Research, Adobe Research, Vanderbilt University
Abstract:
The `pretrain, prompt, predict' paradigm of large language models (LLMs) has achieved remarkable success in open-domain question answering (OD-QA). However, few works explore this paradigm in multi-document question answering (MD-QA), a task demanding a thorough understanding of the logical associations among the contents and structures of documents. To fill this crucial gap, we propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting LLMs for MD-QA, which consists of a graph construction module and a graph traversal module. For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables), and edges denoting the semantic/lexical similarity between passages or document structural relations. For graph traversal, we design an LLM-based graph traversal agent that navigates across nodes and gathers supporting passages assisting LLMs in MD-QA. The constructed graph serves as the global ruler that regulates the transitional space among passages and reduces retrieval latency. Concurrently, the graph traversal agent acts as a local navigator that gathers pertinent context to progressively approach the question and guarantee retrieval quality. Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design and retrieval augmented generation for LLMs. Our code: https://github.com/YuWVandy/KG-LLM-MDQA.



Paperid:2129
Authors:Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao
Wangxuan Institute of Computer Technology, Peking University, Beijing Institute for General Artificial Intelligence National Key Laboratory of General Artificial Intelligence, School of Economics, Peking University, Wangxuan Institute of Computer Technology, Peking University National Key Laboratory of General Artificial Intelligence
Abstract:
Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporalreasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR



Paperid:2130
Authors:Kaiwen Wei, Runyan Du, Li Jin, Jian Liu, Jianhua Yin, Linhao Zhang, Jintao Liu, Nayu Liu, Jingyuan Zhang, Zhi Guo
College of Computer Science, Chongqing University, Chongqing, China, University of Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, Beijing Jiaotong University, Beijing, China, School of Computer Science and Technology, Shandong University, Qingdao, China, University of Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, School of Computer Science and Technology, Tiangong University, Tianjin, China, Kuaishou Technology Inc., Beijing, China, University of Chinese Academy of Sciences, Beijing, China
Abstract:
Video event extraction (VEE) aims to extract key events and generate the event arguments for their semantic roles from the video. Despite promising results have been achieved by existing methods, they still lack an elaborate learning strategy to adequately consider: (1) interobject interaction, which reflects the relation between objects; (2) inter-modality interaction, which aligns the features from text and video modality. In this paper, we propose a Multi-view Interaction with knowledge Distillation (MID) framework to solve the above problems with the Knowledge Distillation (KD) mechanism. Specifically, we propose the self-Relational KD (self-RKD) to enhance the inter-object interaction, where the relation between objects is measured by distance metric, and the high-level relational knowledge from the deeper layer is taken as the guidance for boosting the shallow layer in the video encoder. Meanwhile, to improve the inter-modality interaction, the Layer-to-layer KD (LKD) is proposed, which integrates additional cross-modal supervisions (i.e., the results of cross-attention) with the textual supervising signal for training each transformer decoder layer. Extensive experiments show that without any additional parameters, MID achieves the state-of-the-art performance compared to other strong methods in VEE.



Paperid:2131
Authors:Chenxiao Wu, Wenjun Ke, Peng Wang, Zhizhao Luo, Guozheng Li, Wanyi Chen
School of Computer Science and Engineering, Southeast University, School of Computer Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, School of Computer Science and Engineering, Southeast University School of Cyber Science and Engineering, Southeast University Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China, Beijing Institute of Computer Technology and Application, School of Computer Science and Engineering, Southeast University, School of Cyber Science and Engineering, Southeast University
Abstract:
Named entity recognition (NER) aims to identify and classify specific entities mentioned in textual sentences. Most existing superior NER models employ the standard fully supervised paradigm, which requires a large amount of annotated data during training. In order to maintain performance with insufficient annotation resources (i.e., low resources), incontext learning (ICL) has drawn a lot of attention, due to its plug-and-play nature compared to other methods (e.g., meta-learning and prompt learning). In this manner, how to retrieve high-correlated demonstrations for target sentences serves as the key to emerging ICL ability. For the NER task, the correlation implies the consistency of both ontology (i.e., generalized entity type) and context (i.e., sentence semantic), which is ignored by previous NER demonstration retrieval techniques. To address this issue, we propose ConsistNER, a novel three-stage framework that incorporates ontological and contextual information for low-resource NER. Firstly, ConsistNER employs large language models (LLMs) to pre-recognize potential entities in a zero-shot manner. Secondly, ConsistNER retrieves the sentence-specific demonstrations for each target sentence based on the two following considerations: (1) Regarding ontological consistency, demonstrations are filtered into a candidate set based on ontology distribution. (2) Regarding contextual consistency, an entity-aware self-attention mechanism is introduced to focus more on the potential entities and semantic-correlated tokens. Finally, ConsistNER feeds the retrieved demonstrations for all target sentences into LLMs for prediction. We conduct experiments on four widely-adopted NER datasets, including both general and specific domains. Experimental results show that ConsistNER achieves a 6.01%-26.37% and 3.07%-21.18% improvement over the state-of-the-art baselines on Micro-F1 scores under 1- and 5-shot settings, respectively.



Paperid:2132
Authors:Mingmin Wu, Yuxue Hu, Yongcheng Zhang, Zeng Zhi, Guixin Su, Ying Sha
College of Informatics, Huazhong Agricultural University, Wuhan, China Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, College of Informatics, Huazhong Agricultural University, Wuhan, China Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, College of Informatics, Huazhong Agricultural University, Wuhan, China, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China, College of Informatics, Huazhong Agricultural University, Wuhan, China, College of Informatics, Huazhong Agricultural University, Wuhan, China Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education
Abstract:
Chinese idioms pose a significant challenge for machine reading comprehension due to their metaphorical meanings often diverging from their literal counterparts, leading to metaphorical inconsistency. Furthermore, the same idiom can have different meanings in different contexts, resulting in contextual inconsistency. Although deep learningbased methods have achieved some success in idioms reading comprehension, existing approaches still struggle to accurately capture idiom representations due to metaphorical inconsistency and contextual inconsistency of idioms. To address these challenges, we propose a novel model, Multi-Semantic Contrastive Learning Method (MSCLM), which simultaneously addresses metaphorical inconsistency and contextual inconsistency of idioms. To mitigate metaphorical inconsistency, we propose a metaphor contrastive learning module based on the prompt method, bridging the semantic gap between literal and metaphorical meanings of idioms. To mitigate contextual inconsistency, we propose a multi-semantic cross-attention module to explore semantic features between different metaphors of the same idiom in various contexts. Our model has been compared with multiple current latest models (including GPT-3.5) on multiple Chinese idiom reading comprehension datasets, and the experimental results demonstrate that MSCLM outperforms state-of-the-art models.



Paperid:2133
Authors:Sixing Wu, Jiong Yu, Jiahao Chen, Xiaofan Deng, Wei Zhou
National Pilot School of Software, Yunnan University, Kunming, China Engineering Research Center of Cyberspace, Yunnan University, Kunming, China, National Pilot School of Software, Yunnan University, Kunming, China Engineering Research Center of Cyberspace, Yunnan University, Kunming, China, National Pilot School of Software, Yunnan University, Kunming, China Engineering Research Center of Cyberspace, Yunnan University, Kunming, China, National Pilot School of Software, Yunnan University, Kunming, China Engineering Research Center of Cyberspace, Yunnan University, Kunming, China, National Pilot School of Software, Yunnan University, Kunming, China Engineering Research Center of Cyberspace, Yunnan University, Kunming, China
Abstract:
Knowledgegrounded Dialogue Response Generation (KRG) can facilitate informative and fidelity dialogues using external knowledge. Prior monolingual works can only use the knowledge of the corresponding native language. Thus, due to the prohibitive costs of collecting and constructing external knowledge bases, the limited scale of accessible external knowledge always constrains the ability of KRG, especially in low-resource language scenarios. To this end, we propose a new task, Multi-Source Multilingual Knowledge-Grounded Response Generation (MMKRG), which simultaneously uses multiple knowledge sources of different languages. We notice that simply combining knowledge of different languages is inefficient due to the Cross-Conflict issue and Cross-Repetition issue. Thus, we propose a novel approach MMK-BART, which uses a simple but elegant Estimate-Cluster-Penalize mechanism to overcome the mentioned issues and adopts the multilingual language model mBART as the backbone. Meanwhile, based on the recent multilingual corpus XDailyDialog, we propose an MMKRG dataset MMK-DailyDialog, which has been aligned to the large-scale multilingual commonsense knowledge base ConceptNet and supports four languages (English, Chinese, German, and Italian). Extensive experiments have verified the effectiveness of our dataset and approach in monolingual, cross-lingual, and multilingual scenarios.



Paperid:2134
Authors:Xiaobao Wu, Fengjun Pan, Thong Nguyen, Yichao Feng, Chaoqun Liu, Cong-Duy Nguyen, Anh Tuan Luu
Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, National University of Singapore, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore DAMO Academy, Alibaba Group, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore
Abstract:
Hierarchical topic modeling aims to discover latent topics from a corpus and organize them into a hierarchy to understand documents with desirable semantic granularity. However, existing work struggles with producing topic hierarchies of low affinity, rationality, and diversity, which hampers document understanding. To overcome these challenges, we in this paper propose Transport Plan and Contextaware Hierarchical Topic Model (TraCo). Instead of early simple topic dependencies, we propose a transport plan dependency method. It constrains dependencies to ensure their sparsity and balance, and also regularizes topic hierarchy building with them. This improves affinity and diversity of hierarchies. We further propose a context-aware disentangled decoder. Rather than previously entangled decoding, it distributes different semantic granularity to topics at different levels by disentangled decoding. This facilitates the rationality of hierarchies. Experiments on benchmark datasets demonstrate that our method surpasses state-of-the-art baselines, effectively improving the affinity, rationality, and diversity of hierarchical topic modeling with better performance on downstream tasks.



Paperid:2135
Authors:Yangyu Wu, Xu Han, Wei Song, Miaomiao Cheng, Fei Li
Capital Normal University, Capital Normal University, Capital Normal University, Capital Normal Universty, Wuhan University
Abstract:
Large language models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, they still face significant challenges in automated reasoning, particularly in scenarios involving multistep reasoning. In this paper, we focus on the logical reasoning problem. The main task is to answer a question based on a set of available facts and rules. A lot of work has focused on guiding LLMs to think logically by generating reasoning paths, ignoring the structure among available facts. In this paper, we propose a simple approach MindMap by introducing evidence chains for supporting reasoning. An evidence chain refers to a set of facts that involve the same subject. In this way, we can organize related facts together to avoid missing important information. MindMap can be integrated with existing reasoning framework, such as Chain-of-Thought (CoT) and Selection-Inference (SI), by letting the model select relevant evidence chains instead of independent facts. The experimental results on the bAbI and ProofWriterOWA datasets demonstrate the effectiveness of MindMap.It can significantly improve CoT and SI, especially in multi-step reasoning tasks.



Paperid:2136
Authors:Yiquan Wu, Yifei Liu, Ziyu Zhao, Weiming Lu, Yating Zhang, Changlong Sun, Fei Wu, Kun Kuang
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Alibaba Group, Alibaba Group, Zhejiang University, Zhejiang University
Abstract:
In text classification models, while the unsupervised attention mechanism can enhance performance, it often produces attention distributions that are puzzling to humans, such as assigning high weight to seemingly insignificant conjunctions. Recently, numerous studies have explored Attention Supervision (AS) to guide the model toward more interpretable attention distributions. However, such AS can impact classification performance, especially in specialized domains. In this paper, we address this issue from a causality perspective. Firstly, we leverage the causal graph to reveal two biases in the AS: 1) Bias caused by the label distribution of the dataset. 2) Bias caused by the words' different occurrence ranges that some words can occur across labels while others only occur in a particular label. We then propose a novel Debiased Attention Supervision (DAS) method to eliminate these biases with causal techniques. Specifically, we adopt backdoor adjustment on the label-caused bias and reduce the word-caused bias by subtracting the direct causal effect of the word. Through extensive experiments on two professional text classification datasets (e.g., medicine and law), we demonstrate that our method achieves improved classification accuracy along with more coherent attention distributions.



Paperid:2137
Authors:Zhenyu Wu, Meng Jiang, Chao Shen
Xi'an Jiaotong University, University of Notre Dame, Xi'an Jiaotong University
Abstract:
Chainof-Thought (CoT) prompting methods have enabled large language models (LLMs) to generate reasoning paths and solve math word problems (MWPs). However, they are sensitive to mistakes in the paths, as any mistake can result in an incorrect answer. We propose a novel method named Progressive Rectification Prompting (PRP) to improve average accuracy on eight MWP datasets from 77.3 to 90.5. Given an initial answer from CoT, PRP iterates a verify-then-rectify process to progressively identify incorrect answers and rectify the reasoning paths. With the most likely correct answer, the LLM predicts a masked numerical value in the question; if the prediction does not match the masked value, the answer is likely incorrect. Then the LLM is prompted to re-generate the reasoning path hinted with a set of incorrect answers to prevent itself from repeating previous mistakes. PRP achieves the best performance compared against the CoT methods. Our implementation is made publicly available at https://wzy6642.github.io/prp.github.io/.



Paperid:2138
Authors:Min Xiao, Junnan Zhu, Feifei Zhai, Yu Zhou, Chengqing Zong
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Abstract:
Existing multimodal summarization approaches focus on fusing image features in the encoding process, ignoring the individualized needs for images when generating different summaries. However, whether intuitively or empirically, not all images can improve summary quality. Therefore, we propose a novel Dynamic Image Utilization framework for multimodal Summarization (DIUSum) to select and utilize valuable images for summarization. First, to predict whether an image helps produce a highquality summary, we propose an image selector to score the usefulness of each image. Second, to dynamically utilize the multimodal information, we incorporate the hard and soft guidance from the image selector. Under the guidance, the image information is plugged into the decoder to generate a summary. Experimental results have shown that DIUSum outperforms multiple strong baselines and achieves SOTA on two public multimodal summarization datasets. Further analysis demonstrates that the image selector can reflect the improved level of summary quality brought by the images.



Paperid:2139
Authors:Jiayuan Xie, Zhiping Zhou, Zihan Wu, Xinting Zhang, Jiexin Wang, Yi Cai, Qing Li
Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China, School of Software Engineering, South China University of Technology, Guangzhou, China, School of Software Engineering, South China University of Technology, Guangzhou, China, Department of Mathematics, The University of Hong Kong, Hong Kong SAR, China, School of Software Engineering, South China University of Technology, Guangzhou, China Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China, School of Software Engineering, South China University of Technology, Guangzhou, China Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China, Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Abstract:
Defect detection is a pivotal aspect ensuring product quality and production efficiency in industrial manufacturing. Existing studies on defect detection predominantly focus on locating defects through bounding boxes and classifying defect types. However, their methods can only provide limited information and fail to meet the requirements for further processing after detecting defects. To this end, we propose a novel task called defect detection report generation, which aims to provide more comprehensive and informative insights into detected defects in the form of text reports. For this task, we propose some new datasets, which contain 16 different materials and each defect contains a detailed report of human constructs. In addition, we propose a knowledgeaware report generation model as a baseline for future research, which aims to incorporate additional knowledge to generate detailed analysis and subsequent processing related to defect in images. By constructing defect report datasets and proposing corresponding baselines, we chart new directions for future research and practical applications of this task.



Paperid:2140
Authors:Eric Xing, Saranya Venkatraman, Thai Le, Dongwon Lee
Washington University in St. Louis, The Pennsylvania State University, University of Mississippi, The Pennsylvania State University
Abstract:
Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research. Modern AA leverages an author's consistent writing style to match a text to its author using an AA classifier. AO is the corresponding adversarial task, aiming to modify a text in such a way that its semantics are preserved, yet an AA model cannot correctly infer its authorship. To address privacy concerns raised by stateof-the-art (SOTA) AA methods, new AO methods have been proposed but remain largely impractical to use due to their prohibitively slow training and obfuscation speed, often taking hours. To this challenge, we propose a practical AO method, ALISON, that (1) dramatically reduces training/obfuscation time, demonstrating more than 10x faster obfuscation than SOTA AO methods, (2) achieves better obfuscation success through attacking three transformer-based AA methods on two benchmark datasets, typically performing 15% better than competing methods, (3) does not require direct signals from a target AA classifier during obfuscation, and (4) utilizes unique stylometric features, allowing sound model interpretation for explainable obfuscation. We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts, all while minimally changing the original text semantics. To ensure the reproducibility of our findings, our code and data are available at: https://github.com/EricX003/ALISON.



Paperid:2141
Authors:Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shi-Xiong Zhang, Guangzhi Li, Yi Luo, Rongzhi Gu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, Tencent AI Lab, Tencent AI Lab, Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, Shenzhen International Graduate School, Tsinghua University, Shenzhen, China The Chinese University of Hong Kong, Hong Kong SAR, China, Tencent AI Lab, Tencent AI Lab, Tencent AI Lab, Tencent AI Lab
Abstract:
Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately represent speech emotions. On the contrary, describing speech emotions directly by means of natural language may be a more effective approach. Regrettably, there are not many studies available that have focused on this direction. Therefore, this paper proposes a speech emotion captioning framework named SECap, aiming at effectively describing speech emotions using natural language. Owing to the impressive capabilities of large language models in language comprehension and text generation, SECap employs LLaMA as the text decoder to allow the production of coherent speech emotion captions. In addition, SECap leverages HuBERT as the audio encoder to extract general speech features and QFormer as the Bridge-Net to provide LLaMA with emotion-related speech features. To accomplish this, Q-Former utilizes mutual information learning to disentangle emotion-related speech features and speech contents, while implementing contrastive learning to extract more emotion-related speech features. The results of objective and subjective evaluations demonstrate that: 1) the SECap framework outperforms the HTSAT-BART baseline in all objective evaluations; 2) SECap can generate high-quality speech emotion captions that attain performance on par with human annotators in subjective mean opinion score tests.



Paperid:2142
Authors:Chao Xue, Di Liang, Pengfei Wang, Jing Zhang
Beihang University, Fudan University, Zhejiang University, Beihang University
Abstract:
Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are timeconstrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PLMs) to obtain question representations, while PLMs tend to focus on entity information and ignore entity transfer caused by temporal constraints, and finally fail to learn specific temporal representations of entities. (II) They neither emphasize the graph structure between entities nor explicitly model the multi-hop relationship in the graph, which will make it difficult to solve complex multi-hop question answering. To alleviate this problem, we propose a novel Question Calibration and Multi-Hop Modeling (QC-MHM) network. Specifically, We first calibrate the question representation by fusing the question and the time-constrained concepts in KG. Then, we construct the GNN layer to complete multi-hop message passing. Finally, the question representation is combined with the embedding output by the GNN to generate the final prediction. Empirical results verify that the proposed model achieves better performance than the state-of-the-art models in the benchmark dataset. Notably, the Hits@1 and Hits@10 results of QC-MHM on the CronQuestions dataset's complex questions are absolutely improved by 5.1% and 1.2% compared to the best-performing baseline. Moreover, QC-MHM can generate interpretable and trustworthy predictions.



Paperid:2143
Authors:Xiaojun Xue, Chunxia Zhang, Tianxiang Xu, Zhendong Niu
Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology, Beijing Institute of Technology
Abstract:
Fewshot named entity recognition (NER) aims to recognize novel named entities in low-resource domains utilizing existing knowledge. However, the present few-shot NER models assume that the labeled data are all clean without noise or outliers, and there are few works focusing on the robustness of the cross-domain transfer learning ability to textual adversarial attacks in Few-shot NER. In this work, we comprehensively explore and assess the robustness of few-shot NER models under textual adversarial attack scenario, and found the vulnerability of existing few-shot NER models. Furthermore, we propose a robust two-stage few-shot NER method with Boundary Discrimination and Correlation Purification (BDCP). Specifically, in the span detection stage, the entity boundary discriminative module is introduced to provide a highly distinguishing boundary representation space to detect entity spans. In the entity typing stage, the correlations between entities and contexts are purified by minimizing the interference information and facilitating correlation generalization to alleviate the perturbations caused by textual adversarial attacks. In addition, we construct adversarial examples for few-shot NER based on public datasets Few-NERD and Cross-Dataset. Comprehensive evaluations on those two groups of few-shot NER datasets containing adversarial examples demonstrate the robustness and superiority of the proposed method.



Paperid:2144
Authors:Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang
University of California, Santa Cruz, Mineral.ai, Mineral.ai, Mineral.ai, Mineral.ai, Mineral.ai, University of California, Santa Cruz
Abstract:
Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE) require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and VisionLanguage Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating Inner Monologue, a cognitive process in which an individual engages in silent verbal communication with themselves. More specifically, we enable LLMs and VLMs to interact through natural language conversation (i.e., Inner Monologue) and propose to use a two-stage training process to learn how to do Inner Monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, broadening its potential applications across various AI challenges beyond vision and language tasks.



Paperid:2145
Authors:Saelyne Yang, Sunghyun Park, Yunseok Jang, Moontae Lee
KAIST, LG AI Research, University of Michigan, LG AI Research University of Illinois Chicago
Abstract:
Instructional videos provide detailed howto guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.



Paperid:2146
Authors:Songhua Yang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Yuxiang Jia, Hongying Zan
Zhengzhou University, Zhengzhou University, Zhengzhou University, Zhengzhou University, Zhengzhou University, Zhengzhou University, Zhengzhou University
Abstract:
Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised FineTuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot align responses with experts' intentions. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from continuous pre-training, SFT, to Reinforcement Learning from Human Feedback (RLHF). Additionally, we construct a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We also define a refined annotation rule and evaluation criteria given the unique characteristics of the biomedical domain. Extensive experimental results show that Zhongjing outperforms baselines in various capacities and matches the performance of ChatGPT in some abilities, despite the 100x parameters. Ablation studies also demonstrate the contributions of each component: pre-training enhances medical knowledge, and RLHF further improves instruction-following ability and safety. Our code, datasets, and models are available at https://github.com/SupritYoung/Zhongjing.



Paperid:2147
Authors:Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
Microsoft, Microsoft, Microsoft, Microsoft, Microsoft
Abstract:
Posttraining quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model families, and quantization bit precision has been absent from the literature. In this paper, we conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization using diverse methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these methods to two distinct model families with parameters ranging from 125M to 176B. Our contributions include: (1) a sensitivity analysis revealing that activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization; (2) an evaluation and comparison of existing PTQ methods to optimize model size reduction while minimizing the impact on accuracy, revealing that none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we propose an optimized method called Low-Rank Compensation (LoRC), which employs low-rank matrices to enhance model quality recovery with a minimal increase in model size.



Paperid:2148
Authors:Seonghyeon Ye, Hyeonbin Hwang, Sohee Yang, Hyeongu Yun, Yireun Kim, Minjoon Seo
KAIST, KAIST, UCL, KAIST, LG AI Research, LG AI Research, KAIST
Abstract:
In this paper, we present our finding that prepending a TaskAgnostic Prefix Prompt (TAPP) to the input improves the instruction-following ability of various Large Language Models (LLMs) during inference. TAPP is different from canonical prompts for LLMs in that it is a fixed prompt prepended to the beginning of every input regardless of the target task for zero-shot generalization. We observe that both base LLMs (i.e. not fine-tuned to follow instructions) and instruction-tuned models benefit from TAPP, resulting in 34.58% and 12.26% improvement on average, respectively. This implies that the instruction-following ability of LLMs can be improved during inference time with a fixed prompt constructed with simple heuristics. We hypothesize that TAPP assists language models to better estimate the output distribution by focusing more on the instruction of the target task during inference. In other words, such ability does not seem to be sufficiently activated in not only base LLMs but also many instruction-fine-tuned LLMs.



Paperid:2149
Authors:Shangjian Yin, Peijie Huang, Yuhong Xu
South China Agricultural University, South China Agricultural University, South China Agricultural University
Abstract:
So far, multiintent spoken language understanding (SLU) has become a research hotspot in the field of natural language processing (NLP) due to its ability to recognize and extract multiple intents expressed and annotate corresponding sequence slot tags within a single utterance. Previous research has primarily concentrated on the token-level intent-slot interaction to model joint intent detection and slot filling, which resulted in a failure to fully utilize anisotropic intent-guiding information during joint training. In this work, we present a novel architecture by modeling the multi-intent SLU as a multi-view intent-slot interaction. The architecture resolves the kernel bottleneck of unified multi-intent SLU by effectively modeling the intent-slot relations with utterance, chunk, and token-level interaction. We further develop a neural framework, namely Uni-MIS, in which the unified multi-intent SLU is modeled as a three-view intent-slot interaction fusion to better capture the interaction information after special encoding. A chunk-level intent detection decoder is used to sufficiently capture the multi-intent, and an adaptive intent-slot graph network is used to capture the fine-grained intent information to guide final slot filling. We perform extensive experiments on two widely used benchmark datasets for multi-intent SLU, where our model bets on all the current strong baselines, pushing the state-of-the-art performance of unified multi-intent SLU. Additionally, the ChatGPT benchmark that we have developed demonstrates that there is a considerable amount of potential research value in the field of multi-intent SLU.



Paperid:2150
Authors:Shuo Yin, Guoqiang Zhong
Ocean University of China, Ocean University of China
Abstract:
Aspectbased sentiment analysis (ABSA) is aimed at predicting the sentiment polarities of the aspects included in a sentence instead of the whole sentence itself, and is a fine-grained learning task compared to the conventional text classification. In recent years, on account of the ability to model the connectivity relationships between the words in one sentence, graph neural networks have been more and more popular to handle the natural language processing tasks, and meanwhile many works emerge for the ABSA task. However, most of the works utilizing graph convolution easily incur the over-smoothing problem, while graph Transformer for ABSA has not been explored yet. In addition, although some previous works are dedicated to using both GNN and Transformer to handle text, the methods of tightly combining graph view and sequence view of text is open to research. To address the above issues, we propose a double-view graph Transformer on text (TextGT) for ABSA. In TextGT, the procedure in graph view of text is handled by GNN layers, while Transformer layers deal with the sequence view, and these two processes are tightly coupled, alleviating the over-smoothing problem. Moreover, we propose an algorithm for implementing a kind of densely message passing graph convolution called TextGINConv, to employ edge features in graphs. Extensive experiments demonstrate the effectiveness of our TextGT over the state-of-the-art approaches, and validate the TextGINConv module. The source code is available at https://github.com/shuoyinn/TextGT.



Paperid:2151
Authors:Xunjian Yin, Jin Jiang, Liming Yang, Xiaojun Wan
Peking University, Peking University, Tsinghua University, Peking University
Abstract:
The imperative task of revising or updating the knowledge stored within large language models arises from two distinct sources: intrinsic errors inherent in the model which should be corrected and outdated knowledge due to external shifts in the real world which should be updated. Prevailing efforts in model editing conflate these two distinct categories of edits arising from distinct reasons and directly modify the original knowledge in models into new knowledge. However, we argue that preserving the model's original knowledge remains pertinent. Specifically, if a model's knowledge becomes outdated due to evolving worldly dynamics, it should retain recollection of the historical knowledge while integrating the newfound knowledge. In this work, we introduce the task of Temporal Knowledge Editing (TKE) and establish a benchmark AToKe (Assessment of TempOral Knowledge Editing) to evaluate current model editing methods. We find that while existing model editing methods are effective at making models remember new knowledge, the edited model catastrophically forgets historical knowledge. To address this gap, we propose a simple and general framework termed MultiEditing with Time Objective (METO) for enhancing existing editing models, which edits both historical and new knowledge concurrently and optimizes the model's prediction for the time of each fact. Our assessments demonstrate that while AToKe is still difficult, METO maintains the effectiveness of learning new knowledge and meanwhile substantially improves the performance of edited models on utilizing historical knowledge.



Paperid:2152
Authors:YoungJoon Yoo, JongWon Choi
ImageVision, NAVER Cloud., Chung-Ang University
Abstract:
This paper introduces a novel approach for topic modeling utilizing latent codebooks from VectorQuantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.



Paperid:2153
Authors:Weihao You, Pengcheng Wang, Changlong Li, Zhilong Ji, Jinfeng Bai
Tomorrow Advancing Life, Tomorrow Advancing Life, Tomorrow Advancing Life, Tomorrow Advancing Life, Tomorrow Advancing Life
Abstract:
New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present a meticulously designed evaluation benchmark that leverages the knowledge graph. This evaluation comprises 584 level1 knowledge points and 1,989 level-2 knowledge points, thereby encompassing a comprehensive spectrum of the K12 education domain knowledge. The primary objective is to comprehensively assess the high-level comprehension aptitude and reasoning capabilities of LLMs operating within the Chinese context. Our evaluation incorporates five distinct question types with 39,452 questions. We test the current mainstream LLMs by three distinct modes. Firstly, four prompt evaluation modes were employed to assess the fundamental capacity. Additionally, for choice questions, a result-oriented evaluation approach was designed through data augmentation to assess the model's proficiency in advanced knowledge and reasoning. Moreover, a subset with reasoning process is derived, and the process-oriented testing method is used to test the model's interpretability and higher-order reasoning capacity. We further show models' capability in our knowledge points, and anticipate the evaluation can assist in the assessment of the strengths and deficiencies of LLMs on knowledge points, thus fostering their development within the Chinese context. Our Dataset will be publicly available in https://github.com/tal-tech/chinese-k12-evaluation.



Paperid:2154
Authors:Junjie Yu, Xing Wang, Wenliang Chen
Soochow University, Tencent AI Lab, Soochow University
Abstract:
Automated construction of annotated data holds significant importance in Relation Extraction (RE) tasks due to the hardness and cost of human annotation. In this work, we propose SelfRDGS, a method for Self-supervised Reliable Data Generation and Selection in low-resource RE tasks. At first, we fully utilize the knowledge of triplets as prompts to generate sentences by employing the Large Language Models (LLMs). Since the auto-generated data contains noise, we then propose a ranking-based data selection method to select reliable sentences. Finally, we integrate the data selection and RE model training within a self-supervised iterative framework. Through experimentation on three datasets with low-resource settings, we demonstrate the effectiveness of our proposed approach in constructing annotated data and achieving noteworthy improvements in comparison to multiple baselines. Code, data and models are available at https://github.com/jjyunlp/GenerationRE.



Paperid:2155
Authors:Lang Yu, Qin Chen, Jie Zhou, Liang He
School of Computer Science and Technology, East China Normal University Shanghai Institute of AI for Education, East China Normal University, School of Computer Science and Technology, East China Normal University Shanghai Institute of AI for Education, East China Normal University, School of Computer Science and Technology, East China Normal University Shanghai Institute of AI for Education, East China Normal University, School of Computer Science and Technology, East China Normal University Shanghai Institute of AI for Education, East China Normal University
Abstract:
Large language models (LLMs) have shown great success in various Natural Language Processing (NLP) tasks, whist they still need updates after deployment to fix errors or keep pace with the changing knowledge in the world. Researchers formulate such problem as Model Editing and have developed various editors focusing on different axes of editing properties. However, current editors can hardly support all properties and rely on heavy computational resources. In this paper, we propose a plugin Model Editing method based on neuron-indexed dynamic LoRA (MELO), which alters the behavior of language models by dynamically activating certain LoRA blocks according to the index built in an inner vector database. Our method satisfies various editing properties with high efficiency and can be easily integrated into multiple LLM backbones. Experimental results show that our proposed MELO achieves state-of-the-art editing performance on three sequential editing tasks (document classification, question answering and hallucination correction), while requires the least trainable parameters and computational cost.



Paperid:2156
Authors:Tianyu Yu, Chengyue Jiang, Chao Lou, Shen Huang, Xiaobin Wang, Wei Liu, Jiong Cai, Yangning Li, Yinghui Li, Kewei Tu, Hai-Tao Zheng, Ningyu Zhang, Pengjun Xie, Fei Huang, Yong Jiang
Tsinghua University, ShanghaiTech University, ShanghaiTech University, Alibaba Group, Alibaba Group, ShanghaiTech University, ShanghaiTech University, Tsinghua University, Tsinghua University, ShanghaiTech University, Tsinghua University, Zhejiang University, Alibaba Group, Alibaba Group, Alibaba Group
Abstract:
Large language models (LLMs) have shown impressive abilities for opendomain NLP tasks. However, LLMs are sometimes too footloose for natural language understanding (NLU) tasks which always have restricted output and input format. Their performances on NLU tasks are highly related to prompts or demonstrations and are shown to be poor at performing several representative NLU tasks, such as event extraction and entity typing. To this end, we present SeqGPT, a bilingual (i.e., English and Chinese) open-source autoregressive model specially enhanced for open-domain natural language understanding. We express all NLU tasks with two atomic tasks, which define fixed instructions to restrict the input and output format but still ``open'' for arbitrarily varied label sets. The model is first instruction-tuned with extremely fine-grained labeled data synthesized by ChatGPT and then further fine-tuned by 233 different atomic tasks from 152 datasets across various domains. The experimental results show that SeqGPT has decent classification and extraction ability, and is capable of performing language understanding tasks on unseen domains. We also conduct empirical studies on the scaling of data and model size as well as on the transfer across tasks. Our models are accessible at https://github.com/Alibaba-NLP/SeqGPT.



Paperid:2157
Authors:Quan Yuan, Mehran Kazemi, Xin Xu, Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran
Google Research, Google Research, Google Research, Google Research, Google Research, Google Research
Abstract:
Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex realworld task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between steps. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe how accurately SCTD can be done with the knowledge extracted from pre-trained Large Language Models (LLMs). We introduce a new high-quality human-annotated dataset for this problem and novel metrics to fairly assess performance of LLMs against several baselines. Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. We also propose a number of approaches to further improve their performance, with a relative improvement of 7% to 37%. However, we find that LLMs still struggle to predict pairwise temporal dependencies, which reveals a gap in their understanding of complex tasks.



Paperid:2158
Authors:Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois
FI Group, Puteaux, France LIPN - Université Sorbonne Paris Nord - CNRS UMR 7030, Villetaneuse, France, LIPN - Université Sorbonne Paris Nord - CNRS UMR 7030, Villetaneuse, France, FI Group, Puteaux, France LIPN - Université Sorbonne Paris Nord - CNRS UMR 7030, Villetaneuse, France, LIPN - Université Sorbonne Paris Nord - CNRS UMR 7030, Villetaneuse, France
Abstract:
In this paper, we propose a novel method for joint entity and relation extraction from unstructured text by framing it as a conditional sequence generation problem. In contrast to conventional generative information extraction models that are leftto-right token-level generators, our approach is \textit{span-based}. It generates a linearized graph where nodes represent text spans and edges represent relation triplets. Our method employs a transformer encoder-decoder architecture with pointing mechanism on a dynamic vocabulary of spans and relation types. Our model can capture the structural characteristics and boundaries of entities and relations through span representations while simultaneously grounding the generated output in the original text thanks to the pointing mechanism. Evaluation on benchmark datasets validates the effectiveness of our approach, demonstrating competitive results. Code is available at https://github.com/urchade/ATG.



Paperid:2159
Authors:Jiali Zeng, Fandong Meng, Yongjing Yin, Jie Zhou
Tencent WeChat AI - Pattern Recognition Center Tencent Inc., Tencent WeChat AI - Pattern Recognition Center Tencent Inc., Tencent WeChat AI - Pattern Recognition Center Tencent Inc., Tencent WeChat AI - Pattern Recognition Center Tencent Inc.
Abstract:
Opensourced large language models (LLMs) have demonstrated remarkable efficacy in various tasks with instruction tuning. However, these models can sometimes struggle with tasks that require more specialized knowledge such as translation. One possible reason for such deficiency is that instruction tuning aims to generate fluent and coherent text that continues from a given instruction without being constrained by any task-specific requirements. Moreover, it can be more challenging to tune smaller LLMs with lower-quality training data. To address this issue, we propose a novel framework using examples in comparison to teach LLMs to learn translation. Our approach involves output comparison and preference comparison, presenting the model with carefully designed examples of correct and incorrect translations and an additional preference loss for better regularization. Empirical evaluation on four language directions of WMT2022 and FLORES-200 benchmarks shows the superiority of our proposed method over existing methods. Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations. Please refer to Github for more details: https://github.com/lemon0830/TIM.



Paperid:2160
Authors:Jinshan Zeng, Xianchao Tong, Xianglong Yu, Wenyan Xiao, Qing Huang
Jiangxi Normal University, Jiangxi Normal University, Jiangxi Normal University, Jiangxi University of Science and Technology, Jiangxi Normal University
Abstract:
The hybrid automatic readability assessment (ARA) models that combine deep and linguistic features have recently received rising attention due to their impressive performance. However, the utilization of linguistic features is not fully realized, as ARA models frequently concentrate excessively on numerical values of these features, neglecting valuable structural information embedded within them. This leads to limited contribution of linguistic features in these hybrid ARA models, and in some cases, it may even result in counterproductive outcomes. In this paper, we propose a novel hybrid ARA model named InterpretARA through introducing a linguistic interpreter to better comprehend the structural information contained in linguistic features, and leveraging the contrastive learning that enables the model to understand relative difficulty relationships among texts and thus enhances deep representations. Both documentlevel and segment-level deep representations are extracted and used for the readability assessment. A series of experiments are conducted over four English corpora and one Chinese corpus to demonstrate the effectiveness of the proposed model. Experimental results show that InterpretARA outperforms state-of-the-art models in most corpora, and the introduced linguistic interpreter can provide more useful information than existing ways for ARA.



Paperid:2161
Authors:Ziqian Zeng, Yihuai Hong, Hongliang Dai, Huiping Zhuang, Cen Chen
South China University of Technology, China, South China University of Technology, China, Nanjing University of Aeronautics and Astronautics, China, South China University of Technology, China, South China University of Technology, China Pazhou Laboratory, China
Abstract:
Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers as the objective function during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only requires each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept "Memorized Layer" to measure the hardness of an instance. We incorporate the memorized layer into reward function design, which allows "easy'' instances to focus more on acceleration while ``hard'' instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks using PLMs and LLMs as backbones respectively.



Paperid:2162
Authors:Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Malu Zhang, Haizhou Li
National University of Singapore, Speech Technology Group Universidad Politécnica de Madrid, Spain, National University of Singapore, University of Electronic Science and Technology of China, National University of Singapore The Chinese University of Hong Kong (Shenzhen), China
Abstract:
Automatic evaluation is an integral aspect of dialogue system research. The traditional referencebased NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique, reference-free neural metrics that better align with human evaluations. Notably among them, large language models (LLMs), particularly the instruction-tuned variants like ChatGPT, are shown to be promising substitutes for human judges. Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains inconclusive how effective these LLMs are. To this end, we conduct a comprehensive study on the application of LLMs for automatic dialogue evaluation. Specifically, we analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels, using a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels. Finally, we explore how model-level and dimension-level ensembles impact the evaluation performance. All resources are available at https://github.com/e0397123/comp-analysis.



Paperid:2163
Authors:Chenrui Zhang, Lin Liu, Chuyuan Wang, Xiao Sun, Hongyu Wang, Jinpeng Wang, Mingchen Cai
Meituan Inc., School of Computer and Information Technology, Beijing Jiaotong University, Meituan Inc., Meituan Inc., Meituan Inc., Meituan Inc., Meituan Inc.
Abstract:
As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a twostage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Prompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.



Paperid:2164
Authors:Congzhi Zhang, Linhai Zhang, Deyu Zhou
School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China, School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China, School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China
Abstract:
Multihop fact verification aims to detect the veracity of the given claim by integrating and reasoning over multiple pieces of evidence. Conventional multi-hop fact verification models are prone to rely on spurious correlations from the annotation artifacts, leading to an obvious performance decline on unbiased datasets. Among the various debiasing works, the causal inference-based methods become popular by performing theoretically guaranteed debiasing such as casual intervention or counterfactual reasoning. However, existing causal inference-based debiasing methods, which mainly formulate fact verification as a single-hop reasoning task to tackle shallow bias patterns, cannot deal with the complicated bias patterns hidden in multiple hops of evidence. To address the challenge, we propose Causal Walk, a novel method for debiasing multi-hop fact verification from a causal perspective with front-door adjustment. Specifically, in the structural causal model, the reasoning path between the treatment (the input claim-evidence graph) and the outcome (the veracity label) is introduced as the mediator to block the confounder. With the front-door adjustment, the causal effect between the treatment and the outcome is decomposed into the causal effect between the treatment and the mediator, which is estimated by applying the idea of random walk, and the causal effect between the mediator and the outcome, which is estimated with normalized weighted geometric mean approximation. To investigate the effectiveness of the proposed method, an adversarial multi-hop fact verification dataset and a symmetric multi-hop fact verification dataset are proposed with the help of the large language model. Experimental results show that Causal Walk outperforms some previous debiasing methods on both existing datasets and the newly constructed datasets. Code and data will be released at https://github.com/zcccccz/CausalWalk.



Paperid:2165
Authors:Fang Zhang, Yongxin Zhu, Xiangxiang Wang, Huang Chen, Xing Sun, Linli Xu
School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab, School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence
Abstract:
Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audiovisual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.



Paperid:2166
Authors:Junwei Zhang, Ruifang He, Fengyu Guo, Chang Liu
Center for Artificial Intelligence and Intelligent Medicine, Hangzhou Institute of Medicine, Chinese Academy of Sciences, Zhejiang Province, China., Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China., College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China., CSSC Systems Engineering Research Institute, Beijing, China.
Abstract:
Word Sense Disambiguation (WSD) aims to determine the meaning of the target word according to the given context. Currently, a single representation enhanced by glosses from different dictionaries or languages is used to characterize each word sense. By analyzing the similarity between glosses of the same word sense, we find semantic biases among them, revealing that the glosses have their own descriptive perspectives. Therefore, the traditional approach of integrating all glosses by a single representation results in failing to present the unique semantics revealed by the individual glosses. In this paper, a quantum superposition state is employed to formalize the representations of multiple glosses of the same word sense to reveal their distributions. Furthermore, the quantum interference model is leveraged to calculate the probability that the target word belongs to this superposition state. The advantage is that the interference term can be regarded as a confidence level to guide word sense recognition. Finally, experiments are performed under standard WSD evaluation framework and the latest crosslingual datasets, and the results verify the effectiveness of our model.



Paperid:2167
Authors:Kun Zhang, Jiali Zeng, Fandong Meng, Yuanzhuo Wang, Shiqi Sun, Long Bai, Huawei Shen, Jie Zhou
Institute of Computing Technology, Chinese Academy of Sciences School of Computer Science and Technology, University of Chinese Academy of Sciences, Tencent WeChat AI - Pattern Recognition Center Tencent Inc., Tencent WeChat AI - Pattern Recognition Center Tencent Inc., Institute of Computing Technology, Chinese Academy of Sciences School of Computer Science and Technology, University of Chinese Academy of Sciences Big Data Academy, Zhongke, Big Data Academy, Zhongke, School of Computer Science and Technology, University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Tencent WeChat AI - Pattern Recognition Center Tencent Inc.
Abstract:
Large language models (LLMs) have recently demonstrated remarkable performance across various Natual Language Processing tasks. In the field of multihop reasoning, the Chain-of-thought (CoT) prompt method has emerged as a paradigm, using curated stepwise reasoning demonstrations to enhance LLM's ability to reason and produce coherent rational pathways. To ensure the accuracy, reliability, and traceability of the generated answers, many studies have incorporated information retrieval (IR) to provide LLMs with external knowledge. However, existing CoT with IR methods decomposes questions into sub-questions based on a single compositionality type, which limits their effectiveness for questions involving multiple compositionality types. Additionally, these methods suffer from inefficient retrieval, as complex questions often contain abundant information, leading to the retrieval of irrelevant information inconsistent with the query's intent. In this work, we propose a novel question decomposition framework called TRQA for multi-hop question answering, which addresses these limitations. Our framework introduces a reasoning tree (RT) to represent the structure of complex questions. It consists of four components: the Reasoning Tree Constructor (RTC), the Question Generator (QG), the Retrieval and LLM Interaction Module (RAIL), and the Answer Aggregation Module (AAM). Specifically, the RTC predicts diverse sub-question structures to construct the reasoning tree, allowing a more comprehensive representation of complex questions. The QG generates sub-questions for leaf-node in the reasoning tree, and we explore two methods for QG: prompt-based and T5-based approaches. The IR module retrieves documents aligned with sub-questions, while the LLM formulates answers based on the retrieved information. Finally, the AAM aggregates answers along the reason tree, producing a definitive response from bottom to top.



Paperid:2168
Authors:XiaoHui Zhang, Jiangyan Yi, Chenglong Wang, Chu Yuan Zhang, Siding Zeng, Jianhua Tao
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Computer and Information Technology, University of Beijing Jiaotong, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China University of Science and Technology of China, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China, Department of Automation, Tsinghua University, Beijing, China
Abstract:
The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spreadout distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition.



Paperid:2169
Authors:Xiaotong Zhang, Xuefang Jia, Han Liu, Xinyue Liu, Xianchao Zhang
Dalian University of Technology, Dalian, China, Dalian University of Technology, Dalian, China, Dalian University of Technology, Dalian, China, Dalian University of Technology, Dalian, China, Dalian University of Technology, Dalian, China
Abstract:
Multigoal conversational recommender system (MG-CRS) which is more in line with realistic scenarios has attracted a lot of attention. MG-CRS can dynamically capture the demands of users in conversation, continuously engage their interests, and make recommendations. The key of accomplishing these tasks is to plan a reasonable goal sequence which can naturally guide the user to accept the recommended goal. Previous works have demonstrated that mining the correlations of goals from the goal sequences in the dialogue corpus is helpful for recommending the goal that the user is interested in. However, they independently model correlations for each level of goal (i.e., goal type or entity) and neglect the order of goals appear in the dialogue. In this paper, we propose a goal interaction graph planning framework which constructs a directed heterogeneous graph to flexibly model the correlations between any level of goals and retain the order of goals. We design a goal interaction graph learning module to model the goal correlations and propagate goal representations via directed edges, then use an encoder and a dual-way fusion decoder to extract the most relevant information with the current goal from the conversation and domain knowledge, making the next-goal prediction fully exploit the prior goal correlations and user feedback. Finally we generate engaging responses based on the predicted goal sequence to complete the recommendation task. Experiments on two benchmark datasets show that our method achieves significant improvements in both the goal planning and response generation tasks.



Paperid:2170
Authors:You Zhang, Jin Wang, Liang-Chih Yu, Dan Xu, Xuejie Zhang
Yunnan University, Yunnan University, Yuan Ze University, Yunnan University, Yunnan University
Abstract:
Effectively and efficiently adapting a pretrained language model (PLM) for human-centered text understanding (HCTU) is challenging since user tokens are million-level in most personalized applications and do not have concrete explicit semantics. A standard and parameter-efficient approach (e.g., LoRA) necessitates memorizing numerous suits of adapters for each user. In this work, we introduce a personalized LoRA (PLoRA) with a plug-and-play (PnP) framework for the HCTU task. PLoRA is effective, parameter-efficient, and dynamically deploying in PLMs. Moreover, a personalized dropout and a mutual information maximizing strategies are adopted and hence the proposed PLoRA can be well adapted to few/zero-shot learning scenarios for the cold-start issue. Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods in full/few/zero-shot learning scenarios for the HCTU task, even though it has fewer trainable parameters. For reproducibility, the code for this paper is available at: https://github.com/yoyo-yun/PLoRA.



Paperid:2171
Authors:Yu Zhang, Rongjie Huang, Ruiqi Li, JinZheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Huawei Cloud, Huawei Cloud, Huawei Cloud, Zhejiang University
Abstract:
Style transfer for outof-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/.



Paperid:2172
Authors:Yu Zhang, Yunyi Zhang, Yanzhen Shen, Yu Deng, Lucian Popa, Larisa Shwartz, ChengXiang Zhai, Jiawei Han
University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign, IBM Thomas J. Watson Research Center, IBM Almaden Research Center, IBM Thomas J. Watson Research Center, University of Illinois at Urbana-Champaign, University of Illinois at Urbana-Champaign
Abstract:
Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive humanannotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines. Code and data are available at: https://github.com/yuzhimanhua/SEType.



Paperid:2173
Authors:Yue Zhang, Ming Zhang, Haipeng Yuan, Shichun Liu, Yongyao Shi, Tao Gui, Qi Zhang, Xuanjing Huang
School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, Shanghai Advanced Institute of Finance, Shanghai Jiaotong University, Shanghai, China, Institute of Modern Languages and Linguistics, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China, School of Computer Science, Fudan University, Shanghai, China
Abstract:
Recently, the evaluation of Large Language Models has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first two questions, which are basically what tasks to give the LLM during testing and what kind of knowledge it should deal with. As for the third question, which is about what standards to use, the types of evaluators, how to score, and how to rank, there hasn't been much discussion. In this paper, we analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowdsourcing, public annotators and GPT-4, with different scoring methods and ranking systems. We propose a new dataset, LLMEval and conduct evaluations on 20 LLMs. A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results. We perform comparisons and analyses of different settings and conduct 10 conclusions that can provide some insights for evaluating LLM in the future. The dataset and the results are publicly available at https://github.com/llmeval. The version with the appendix are publicly available at https://arxiv.org/abs/2312.07398.



Paperid:2174
Authors:Zhuowei Zhang, Mengting Hu, Yinhao Bai, Zhen Zhang
Nankai University, Nankai University, Nankai University, Nankai University
Abstract:
Mindmap generation aims to process a document into a hierarchical structure to show its central idea and branches. Such a manner is more conducive to understanding the logic and semantics of the document than plain text. Recently, a state-of-the-art method encodes the sentences of a document sequentially and converts them to a relation graph via sequence-to-graph. Though this method is efficient to generate mind-maps in parallel, its mechanism focuses more on sequential features while hardly capturing structural information. Moreover, it's difficult to model long-range semantic relations. In this work, we propose a coreference-guided mind-map generation network (CMGN) to incorporate external structure knowledge. Specifically, we construct a coreference graph based on the coreference semantic relationship to introduce the graph structure information. Then we employ a coreference graph encoder to mine the potential governing relations between sentences. In order to exclude noise and better utilize the information of the coreference graph, we adopt a graph enhancement module in a contrastive learning manner. Experimental results demonstrate that our model outperforms all the existing methods. The case study further proves that our model can more accurately and concisely reveal the structure and semantics of a document. Code and data are available at https://github.com/Cyno2232/CMGN.



Paperid:2175
Authors:Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang
Department of Automation, BNRist, Tsinghua University, Department of Computer Science, BNRist, Tsinghua University, Department of Computer Science, BNRist, Tsinghua University, Department of Computer Science, BNRist, Tsinghua University, Department of Computer Science, BNRist, Tsinghua University, Department of Automation, BNRist, Tsinghua University
Abstract:
The recent surge in research interest in applying large language models (LLMs) to decisionmaking tasks has flourished by leveraging the extensive world knowledge embedded in LLMs. While there is a growing demand to tailor LLMs for custom decision-making tasks, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. This scenario emphasizes the growing need for new methodologies that allow learning from agent experiences without requiring parametric updates. To address these problems, we introduce the Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results highlight the robust learning efficacy of the ExpeL agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.



Paperid:2176
Authors:Rui Zhao, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, Yidong Chen
School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, School of Informatics, Xiamen University, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
Abstract:
Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multimodal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs.We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.



Paperid:2177
Authors:Ruilin Zhao, Feng Zhao, Liang Hu, Guandong Xu
Natural Language Processing and Knowledge Graph Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China Data Science and Machine Intelligence Lab, University of Technology Sydney, Sydney, Australia, Natural Language Processing and Knowledge Graph Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China, College of Electronic and Information Engineering, Tongji University, Shanghai, China, Data Science and Machine Intelligence Lab, University of Technology Sydney, Sydney, Australia
Abstract:
Augmenting Language Models (LMs) with structured knowledge graphs (KGs) aims to leverage structured world knowledge to enhance the capability of LMs to complete knowledgeintensive tasks. However, existing methods are unable to effectively utilize the structured knowledge in a KG due to their inability to capture the rich relational semantics of knowledge triplets. Moreover, the modality gap between natural language text and KGs has become a challenging obstacle when aligning and fusing cross-modal information. To address these challenges, we propose a novel knowledge-augmented question answering (QA) model, namely, Graph Reasoning Transformers (GRT). Different from conventional node-level methods, the GRT serves knowledge triplets as atomic knowledge and utilize a triplet-level graph encoder to capture triplet-level graph features. Furthermore, to alleviate the negative effect of the modality gap on joint reasoning, we propose a representation alignment pretraining to align the cross-modal representations and introduce a cross-modal information fusion module with attention bias to enable fine-grained information fusion. Extensive experiments conducted on three knowledge-intensive QA benchmarks show that the GRT outperforms the state-of-the-art KG-augmented QA systems, demonstrating the effectiveness and adaptation of our proposed model.



Paperid:2178
Authors:Tanglong Zhao, Ruifang He, Jing Xu, Bo Wang
College of Intelligence and Computing, Tianjin University, Tianjin, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, College of Intelligence and Computing, Tianjin University, Tianjin, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, College of Intelligence and Computing, Tianjin University, Tianjin, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, College of Intelligence and Computing, Tianjin University, Tianjin, China Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
Abstract:
Social summarization aims to provide summaries for a large number of social texts (called posts) about a single topic. To extract a summary, both the representation of post and summary selection method are crucial. Previous methods introduce social relation to enhance post embedding to mitigate the sparse representation due to its brief and informal expression. However, they ignore that there are multiple relations between posts. Besides, existing graphbased centrality calculation approaches tend to select posts from one aspect. This leads to facet bias especially when there are multiple viewpoints. In this paper, we propose a model named MultiSum to improve social summarization. Specifically, 1) We use graph convolutional networks to fuse text content with social and semantic relations to improve post representation; 2) The similarity between the summary and all aspects is incorporated into the centrality score during the selection phase, encouraging the model to pay attention to different facets. Experimental results on English and Chinese corpora support the effectiveness of this model. Furthermore, external evaluations by human experts and large language models demonstrate the validity of MultiSum in facet coverage and redundancy reduction.



Paperid:2179
Authors:Xingqiang Zhao, Hai Wan, Kunxun Qi
School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University, School of Computer Science and Engineering, Sun Yat-sen University
Abstract:
Aspectbased sentiment analysis (ABSA) has attracted much attention due to its wide application scenarios. Most previous studies have focused solely on monolingual ABSA, posing a formidable challenge when extending ABSA applications to multilingual scenarios. In this paper, we study upgrading monolingual ABSA to cross-lingual ABSA. Existing methods usually exploit pre-trained cross-lingual language to model cross-lingual ABSA, and enhance the model with translation data. However, the low-resource languages might be under-represented during the pre-training phase, and the translation-enhanced methods heavily rely on the quality of the translation and label projection. Inspired by the observation that quantum entanglement can correlate multiple single systems, we map the monolingual expression to the quantum Hilbert space as a single quantum system, and then utilize quantum entanglement and quantum measurement to achieve cross-lingual ABSA. Specifically, we propose a novel quantum neural model named QPEN (short for quantum projection and quantum entanglement enhanced network). It is equipped with a proposed quantum projection module that projects aspects as quantum superposition on a complex-valued Hilbert space. Furthermore, a quantum entanglement module is proposed in QPEN to share language-specific features between different languages without transmission. We conducted simulation experiments on the classical computer, and experimental results on SemEval-2016 dataset demonstrate that our method achieves state-of-the-art performance in terms of F1-scores for five languages.



Paperid:2180
Authors:Hang Zheng, Qingsong Li, Shen Chen, Yuxuan Liang, Li Liu
School of Big Data and Software Engineering, Chongqing University, China, School of Big Data and Software Engineering, Chongqing University, China, School of Big Data and Software Engineering, Chongqing University, China, The Hong Kong University of Science and Technology (Guangzhou), China, School of Big Data and Software Engineering, Chongqing University, China
Abstract:
Recently, lots of works that incorporate external lexicon information into characterlevel Chinese named entity recognition(NER) to overcome the lackness of natural delimiters of words, have achieved many advanced performance. However, obtaining and maintaining high-quality lexicons is costly, especially in special domains. In addition, the entity boundary bias caused by high mention coverage in some boundary characters poses a significant challenge to the generalization of NER models but receives little attention in the existing literature. To address these issues, we propose SENCR, a Span Enhanced Two-stage Network with Counterfactual Rethinking for Chinese NER, that contains a boundary detector for boundary supervision, a convolution-based type classifier for better span representation and a counterfactual rethinking(CR) strategy for debiased boundary detection in inference. The proposed boundary detector and type classifier are jointly trained with the same contextual encoder and then the trained boundary detector is debiased by our proposed CR strategy without modifying any model parameters in the inference stage. Extensive experiments on four Chinese NER datasets show the effectiveness of our proposed approach.



Paperid:2181
Authors:Li Zheng, Hao Fei, Fei Li, Bobo Li, Lizi Liao, Donghong Ji, Chong Teng
Wuhan University, National University of Singapore, Wuhan University, Wuhan University, Singapore Management University, Wuhan University, Wuhan University
Abstract:
With the proliferation of dialogic data across the Internet, the Dialogue Commonsense Multichoice Question Answering (DC-MCQ) task has emerged as a response to the challenge of comprehending user queries and intentions. Although prevailing methodologies exhibit effectiveness in addressing single-choice questions, they encounter difficulties in handling multi-choice queries due to the heightened intricacy and informational density. In this paper, inspired by the human cognitive process of progressively excluding options, we propose a three-step Reverse Exclusion Graph-of-Thought (ReX-GoT) framework, including Option Exclusion, Error Analysis, and Combine Information. Specifically, our ReX-GoT mimics human reasoning by gradually excluding irrelevant options and learning the reasons for option errors to choose the optimal path of the GoT and ultimately infer the correct answer. By progressively integrating intricate clues, our method effectively reduces the difficulty of multi-choice reasoning and provides a novel solution for DC-MCQ. Extensive experiments on the CICERO and CICERO_v2 datasets validate the significant improvement of our approach on DC-MCQ task. On zero-shot setting, our model outperform the best baseline by 17.67% in terms of F1 score for the multi-choice task. Most strikingly, our GPT3.5-based ReX-GoT framework achieves a remarkable 39.44% increase in F1 score.



Paperid:2182
Authors:Meizhen Zheng, Peng Bai, Xiaodong Shi, Xun Zhou, Yiting Yan
Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China
Abstract:
Although singing voice synthesis (SVS) has made significant progress recently, with its unique styles and various genres, Chinese opera synthesis requires greater attention but is rarely studied for lack of training data and high expressiveness. In this work, we build a highquality Gezi Opera (a type of Chinese opera popular in Fujian and Taiwan) audio-text alignment dataset and formulate specific data annotation methods applicable to Chinese operas. We propose FT-GAN, an acoustic model for fine-grained tune modeling in Chinese opera synthesis based on the empirical analysis of the differences between Chinese operas and pop songs. To further improve the quality of the synthesized opera, we propose a speech pre-training strategy for additional knowledge injection. The experimental results show that FT-GAN outperforms the strong baselines in SVS on the Gezi Opera synthesis task. Extensive experiments further verify that FT-GAN performs well on synthesis tasks of other operas such as Peking Opera. Audio samples, the dataset, and the codes are available at https://zhengmidon.github.io/FTGAN.github.io/.



Paperid:2183
Authors:Yafang Zheng, Lei Lin, Shuangtao Li, Yuxuan Yuan, Zhaohong Lai, Shan Liu, Biao Fu, Yidong Chen, Xiaodong Shi
Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China Kuaishou Technology, Beijing, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China, Department of Artificial Intelligence, School of Informatics, Xiamen University Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
Abstract:
Existing neural models are demonstrated to struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. A key reason for failure on CG is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled. However, previous work concentrates on separating the learning of syntax and semantics instead of exploring the reasons behind the representation entanglement (RE) problem to solve it. We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers. We find that the ``shallow'' residual connections within each layer fail to fuse previous layers' information effectively, leading to information forgetting between layers and further the RE problems. Inspired by this, we propose LRF, a novel Layerwise Representation Fusion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process effectively through introducing a fuse-attention module at each encoder and decoder layer. LRF achieves promising results on two realistic benchmarks, empirically demonstrating the effectiveness of our proposal. Codes are available at https://github.com/thinkaboutzero/LRF.



Paperid:2184
Authors:Yongqiang Zheng, Xia Li
School of Information Science and Technology Guangdong University of Foreign Studies, Guangzhou, China, School of Information Science and Technology Guangdong University of Foreign Studies, Guangzhou, China
Abstract:
Most of the existing aspectbased sentiment analysis (ABSA) models only predict the sentiment polarity of a single aspect at a time, focusing primarily on enhancing the representation of this single aspect based on the other contexts or aspects. This one-to-one paradigm ignores the fact that multi-aspect, multi-sentiment sentences contain not only distinct specific descriptions for distinct specific aspects, but also shared global context information for multiple aspects. To fully consider these issues, we propose a one-to-many ABSA framework, called You Only Read Once (YORO), that can simultaneously model representations of all aspects based on their specific descriptions and better fuse their relationships using globally shared contextual information in the sentence. Predicting the sentiment polarity of multiple aspects simultaneously is beneficial to improving the efficacy of calculation and prediction. Extensive experiments are conducted on three public datasets (MAMS, Rest14, and Lap14). Experimental results demonstrate the effectiveness of YORO in handling multi-aspect, multi-sentiment scenarios and highlight the promise of one-to-many ABSA in balancing efficiency and accuracy.



Paperid:2185
Authors:Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, Yanlin Wang
Sun Yat-sen University, Sun Yat-sen University, Harbin Institute of Technology, KTH Royal Institute of Technology, Sun Yat-sen University
Abstract:
Large Language Models (LLMs) have drastically reshaped our interactions with artificial intelligence (AI) systems, showcasing impressive performance across an extensive array of tasks. Despite this, a notable hindrance remains—the deficiency of a longterm memory mechanism within these models. This shortfall becomes increasingly evident in situations demanding sustained interaction, such as personal companion systems, psychological counseling, and secretarial assistance. Recognizing the necessity for long-term memory, we propose MemoryBank, a novel memory mechanism tailored for LLMs. MemoryBank enables the models to summon relevant memories, continually evolve through continuous memory updates, comprehend, and adapt to a user's personality over time by synthesizing information from previous interactions. To mimic anthropomorphic behaviors and selectively preserve memory, MemoryBank incorporates a memory updating mechanism, inspired by the Ebbinghaus Forgetting Curve theory. This mechanism permits the AI to forget and reinforce memory based on time elapsed and the relative significance of the memory, thereby offering a more human-like memory mechanism and enriched user experience. MemoryBank is versatile in accommodating both closed-source models like ChatGPT and open-source models such as ChatGLM. To validate MemoryBank's effectiveness, we exemplify its application through the creation of an LLM-based chatbot named SiliconFriend in a long-term AI Companion scenario. Further tuned with psychological dialog data, SiliconFriend displays heightened empathy and discernment in its interactions. Experiment involves both qualitative analysis with real-world user dialogs and quantitative analysis with simulated dialogs. In the latter, ChatGPT acts as multiple users with diverse characteristics and generates long-term dialog contexts covering a wide array of topics. The results of our analysis reveal that SiliconFriend, equipped with MemoryBank, exhibits a strong capability for long-term companionship as it can provide emphatic response, recall relevant memories and understand user personality.



Paperid:2186
Authors:Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Jianbing Shen, Guodong Long, Can Xu, Daxin Jiang
SKL-IOTSC, CIS, University of Macau, AAII, FEIT, University of Technology Sydney, Microsoft Corporation, Microsoft Corporation, SKL-IOTSC, CIS, University of Macau, AAII, FEIT, University of Technology Sydney, Microsoft Corporation, Microsoft Corporation
Abstract:
Long document retrieval aims to fetch queryrelevant documents from a large-scale collection, where knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder. However, in contrast to passages or sentences, retrieval on long documents suffers from the \textit{scope hypothesis} that a long document may cover multiple topics. This maximizes their structure heterogeneity and poses a granular-mismatch issue, leading to an inferior distillation efficacy. In this work, we propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers. While preserving the conventional dense retrieval paradigm, it first produces global-consistent representations crossing different fine granularity and then applies multi-granular aligned distillation merely during training. In experiments, we evaluate our framework on two long-document retrieval benchmarks, which show state-of-the-art performance.



Paperid:2187
Authors:Zhenhong Zhou, Jiuyang Xiang, Chaomeng Chen, Sen Su
Beijing University of Posts and Telecommunications, University of Michigan, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications
Abstract:
Large language models (LLMs) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. As the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. Quantifying language model memorization helps evaluate potential privacy risks. However, prior works on quantifying memorization require access to the precise original data or incur substantial computational overhead, making it difficult for applications in realworld language models. To this end, we propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios. In addition, we also present an approach for efficiently extracting sensitive entities from autoregressive language models. We conduct extensive experiments based on the proposed, probing language models' ability to reconstruct sensitive entities under different settings. We find that language models have strong memorization at the entity level and are able to reproduce the training data even with partial leakages. The results demonstrate that LLMs not only memorize their training data but also understand associations between entities. These findings necessitate that trainers of LLMs exercise greater prudence regarding model memorization, adopting memorization mitigation techniques to preclude privacy violations.



Paperid:2188
Authors:Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan Ye, Wei Liu, Wei Wang, Xiaowei Huang, Kaizhu Huang
Xi'an Jiaotong-Liverpool University, Xi'an Jiaotong-Liverpool University, Northwestern University, Xi'an Jiaotong-Liverpool University, Xi'an Jiaotong-Liverpool University, ShanghaiTech University, Xi'an Jiaotong-Liverpool University, University of Liverpool, Duke Kunshan University
Abstract:
With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has recently made great progress. However, there are few studies to examine the robustness of LLMs in math solving ability. Instead of attacking prompts in the use of LLMs, we propose a MathAttack model to attack MWP samples which are closer to the essence of robustness in solving math problems. Compared to traditional text adversarial attack, it is essential to preserve the mathematical logic of original MWPs during the attacking. To this end, we propose logical entity recognition to identify logical entries which are then frozen. Subsequently, the remaining text are attacked by adopting a wordlevel attacker. Furthermore, we propose a new dataset RobustMath to evaluate the robustness of LLMs in math solving ability. Extensive experiments on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth show that MathAttack could effectively attack the math solving ability of LLMs. In the experiments, we observe that (1) Our adversarial samples from higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy (e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot prompts); (2) Complex MWPs (such as more solving steps, longer text, more numbers) are more vulnerable to attack; (3) We can improve the robustness of LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our practice and observation can serve as an important attempt towards enhancing the robustness of LLMs in math solving ability. The code and dataset is available at: https://github.com/zhouzihao501/MathAttack.



Paperid:2189
Authors:Hai Zhu, Qingyang Zhao, Weiwei Shang, Yuren Wu, Kai Liu
University of Science and Technology of China Ping An Technology, Xidian University, University of Science and Technology of China, Ping An Technology, Lazada
Abstract:
Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt model internal information (gradients or confidence scores) to generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hardlabel attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models and some defense methods, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.



Paperid:2190
Authors:Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai
University of Science and Technology of China, University of Science and Technology of China (USTC), Tencent AI Lab, Nanyang Technological University, University of Science and Technology of China
Abstract:
Selfsupervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multi-channel audio streams and a visual stream in parallel, with intra-, and inter-channel contrastive as training targets to fully exploit the rich information in multi-channel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of multichannel multi-modal representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.



Paperid:2191
Authors:Zhihong Zhu, Xuxin Cheng, Yaowei Li, Hongxiang Li, Yuexian Zou
Peking University, Peking University, Peking University, Peking University, Peking University
Abstract:
Multiintent spoken language understanding (SLU) has garnered growing attention due to its ability to handle multiple intent utterances, which closely mirrors practical scenarios. Unlike traditional SLU, each intent in multi-intent SLU corresponds to its designated scope for slots, which occurs in certain fragments within the utterance. As a result, establishing precise scope alignment to mitigate noise impact emerges as a key challenge in multi-intent SLU. More seriously, they lack alignment between the predictions of the two sub-tasks due to task-independent decoding, resulting in a limitation on the overall performance. To address these challenges, we propose a novel framework termed Aligner² for multi-intent SLU, which contains an Adjustive Cross-task Aligner (ACA) and a Forced Cross-task Aligner (FCA). ACA utilizes the information conveyed by joint label embeddings to accurately align the scope of intent and corresponding slots, before the interaction of the two subtasks. FCA introduces reinforcement learning, to enforce the alignment of the task-specific hidden states after the interaction, which is explicitly guided by the prediction. Extensive experiments on two public multi-intent SLU datasets demonstrate the superiority of our Aligner² over state-of-the-art methods. More encouragingly, the proposed method Aligner² can be easily integrated into existing multi-intent SLU frameworks, to further boost performance.



Paperid:2192
Authors:Xianwei Zhuang, Xuxin Cheng, Yuexian Zou
School of ECE, Peking University, China, School of ECE, Peking University, China, School of ECE, Peking University, China
Abstract:
Recent joint models for multiintent detection and slot filling have obtained promising results through modeling the unidirectional or bidirectional guidance between intent and slot. However, existing works design joint models heuristically and lack some theoretical exploration, including (1) theoretical measurement of the joint-interaction quality; (2) explainability of design and optimization methods of joint models, which may limit the performance and efficiency of designs. In this paper, we mathematically define the cross-task information gain (CIG) to measure the quality of joint processes from an information-theoretic perspective and discover an implicit optimization of CIG in previous models. Based on this, we propose a novel multi-stage iterative framework with theoretical effectiveness, explainability, and convergence, which can explicitly optimize information for cross-task interactions. Further, we devise an information-based joint model (InfoJoint) that conforms to this theoretical framework to gradually reduce the cross-task propagation of erroneous semantics through CIG iterative maximization. Extensive experiment results on two public datasets show that InfoJoint outperforms the state-of-the-art models by a large margin.



Paperid:2193
Authors:Linlin Zong, Jiahui Wan, Xianchao Zhang, Xinyue Liu, Wenxin Liang, Bo Xu
Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Dalian University of Technology
Abstract:
Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the VideoContext Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.



Paperid:2194
Authors:Allen Chang, Matthew C. Fontaine, Serena Booth, Maja J. Matarić, Stefanos Nikolaidis
University of Southern California, University of Southern California, Massachusetts Institute of Technology, University of Southern California, University of Southern California
Abstract:
Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose qualitydiversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling.



Paperid:2195
Authors:Wenshuo Chao, Zhaopeng Qiu, Likang Wu, Zhuoning Guo, Zhi Zheng, Hengshu Zhu, Hao Liu
The Hong Kong University of Science and Technology (Guangzhou) Career Science Lab, BOSS Zhipin, Career Science Lab, BOSS Zhipin, University of Science and Technology of China Career Science Lab, BOSS Zhipin, The Hong Kong University of Science and Technology (Guangzhou), University of Science and Technology of China Career Science Lab, BOSS Zhipin, Career Science Lab, BOSS Zhipin The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology (Guangzhou)
Abstract:
The rapidly changing landscape of technology and industries leads to dynamic skill requirements, making it crucial for employees and employers to anticipate such shifts to maintain a competitive edge in the labor market. Existing efforts in this area either relies on domainexpert knowledge or regarding the skill evolution as a simplified time series forecasting problem. However, both approaches overlook the sophisticated relationships among different skills and the inner-connection between skill demand and supply variations. In this paper, we propose a Cross-view Hierarchical Graph learning Hypernetwork (CHGH) framework for joint skill demand-supply prediction. Specifically, CHGH is an encoder-decoder network consisting of i) a cross-view graph encoder to capture the interconnection between skill demand and supply, ii) a hierarchical graph encoder to model the co-evolution of skills from a cluster-wise perspective, and iii) a conditional hyper-decoder to jointly predict demand and supply variations by incorporating historical demand-supply gaps. Extensive experiments on three real-world datasets demonstrate the superiority of the proposed framework compared to seven baselines and the effectiveness of the three modules.



Paperid:2196
Authors:Qiuyu Duan, Zhongyun Hua, Qing Liao, Yushu Zhang, Leo Yu Zhang
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Nanjing University of Aeronautics and Astronautics, Griffith University
Abstract:
Deep neural network (DNN) models have been proven vulnerable to backdoor attacks. One trend of backdoor attacks is developing more invisible and dynamic triggers to make attacks stealthier. However, these invisible and dynamic triggers can be inadvertently mitigated by some widely used passive denoising operations, such as image compression, making the efforts under this trend questionable. Another trend is to exploit the full potential of backdoor attacks by proposing new triggering paradigms, such as hibernated or opportunistic backdoors. In line with these trends, our work investigates the first conditional backdoor attack, where the backdoor is activated by a specific condition rather than predefined triggers. Specifically, we take the JPEG compression as our condition and jointly optimize the compression operator and the target model's loss function, which can force the target model to accurately learn the JPEG compression behavior as the triggering condition. In this case, besides the conditional triggering feature, our attack is also stealthy and robust to denoising operations. Extensive experiments on the MNIST, GTSRB and CelebA verify our attack's effectiveness, stealthiness and resistance to existing backdoor defenses and denoising operations. As a new triggering paradigm, the conditional backdoor attack brings a new angle for assessing the vulnerability of DNN models, and conditioned over JPEG compression magnifies its threat due to the universal usage of JPEG.



Paperid:2197
Authors:Dashan Gao, Sheng Wan, Lixin Fan, Xin Yao, Qiang Yang
Southern University of Science and Technology, Shenzhen, China Hong Kong University of Science and Technology, Hong Kong SAR, China, Southern University of Science and Technology, Shenzhen, China Hong Kong University of Science and Technology, Hong Kong SAR, China, WeBank AI Lab, Shenzhen, China, Southern University of Science and Technology, Shenzhen, China, Hong Kong University of Science and Technology, Hong Kong SAR, China
Abstract:
Vertical Federated Learning (VFL) enables an active party with labeled data to enhance model performance (utility) by collaborating with multiple passive parties that possess auxiliary features corresponding to the same sample identifiers (IDs). Model serving in VFL is vital for realworld, delay-sensitive applications, and it faces two major challenges: 1) robustness against arbitrarily-aligned data and stragglers; and 2) privacy protection, ensuring minimal label leakage to passive parties. Existing methods fail to transfer knowledge among parties to improve robustness in a privacy-preserving way. In this paper, we introduce a privacy-preserving knowledge transfer framework, Complementary Knowledge Distillation (CKD), designed to enhance the robustness and privacy of multi-party VFL systems. Specifically, we formulate a Complementary Label Coding (CLC) objective to encode only complementary label information of the active party's local model for passive parties to learn. Then, CKD selectively transfers the CLC-encoded complementary knowledge 1) from the passive parties to the active party, and 2) among the passive parties themselves. Experimental results on four real-world datasets demonstrate that CKD outperforms existing approaches in terms of robustness against arbitrarily-aligned data, while also minimizing label privacy leakage.



Paperid:2198
Authors:Rebecca Gelles, Veronica Kinoshita, Micah Musser, James Dunham
Center for Security and Emerging Technology, Georgetown University, Center for Security and Emerging Technology, Georgetown University, Center for Security and Emerging Technology, Georgetown University, Center for Security and Emerging Technology, Georgetown University
Abstract:
Access to compute is widely viewed as a primary barrier to AI research progress. Compute resource stratification between academic and industry researchers is therefore a source of concern. Yet the experiences of researchers who might encounter resource constraints in their work have received no direct study. We addressed this gap by conducting a large survey of AI researchers that posed questions about project inputs, outcomes, and challenges. Contrary to popular narratives, responses from more than 500 participants revealed more concern about talent and data limitations than compute access. There were few differences between academic and industry researchers in this regard. The exception were researchers who already use large amounts of compute, and expressed a need for more. These findings suggest that interventions to subsidize compute without addressing the limitations on talent and data availability reported by our respondents might cause or exacerbate commonly cited resource inequalities, with unknown impact on the future of equitable research.



Paperid:2199
Authors:Soumya Suvra Ghosal, Yiyou Sun, Yixuan Li
University of Wisconsin, Madison, University of Wisconsin, Madison, University of Wisconsin, Madison
Abstract:
Machine learning models deployed in the wild can be challenged by outof-distribution (OOD) data from unknown classes. Recent advances in OOD detection rely on distance measures to distinguish samples that are relatively far away from the in-distribution (ID) data. Despite the promise, distance-based methods can suffer from the curse-of-dimensionality problem, which limits the efficacy in high dimensional feature space. To combat this problem, we propose a novel framework, Subspace Nearest Neighbor (SNN), for OOD detection. In training, our method regularizes the model and its feature representation by leveraging the most relevant subset of dimensions (i.e. subspace). The subspace learning yields highly distinguishable distance measures between ID and OOD data. We provide comprehensive experiments and ablations to validate the efficacy of SNN. Compared to the current best distance-based method, SNN reduces the average FPR95 by 15.96% on the CIFAR-100 benchmark.



Paperid:2200
Authors:Xiaoyuan Guan, Jiankang Chen, Shenshen Bu, Yuren Zhou, Wei-Shi Zheng, Ruixuan Wang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Key Laboratory of Machine Intelligence and Advanced Computing, MOE, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China
Abstract:
Recent studies on outof-distribution (OOD) detection focus on designing models or scoring functions that can effectively distinguish between unseen OOD data and in-distribution (ID) data. In this paper, we propose a simple yet novel ap- proach to OOD detection by leveraging the phenomenon that the average of feature vector elements from convolutional neural network (CNN) is typically larger for ID data than for OOD data. Specifically, the average of feature vector elements is used as part of the scoring function to further separate OOD data from ID data. We also provide mathematical analysis to explain this phenomenon. Experimental evaluations demonstrate that, when combined with a strong baseline, our method can achieve state-of-the-art performance on several OOD detection benchmarks. Furthermore, our method can be easily integrated into various CNN architectures and requires less computation. Source code address: https://github.com/SYSU-MIA-GROUP/statistical_discrepancy_ood.



Paperid:2201
Authors:Hao Jiang, Tien Mai, Pradeep Varakantham, Huy Hoang
Singapore Management University, Singapore Management University, Singapore Management University, Singapore Management University
Abstract:
Constrained Reinforcement Learning employs trajectorybased cost constraints (such as expected cost, Value at Risk, or Conditional VaR cost) to compute safe policies. The challenge lies in handling these constraints effectively while optimizing expected reward. Existing methods convert such trajectory-based constraints into local cost constraints, but they rely on cost estimates, leading to either aggressive or conservative solutions with regards to cost. We propose an unconstrained formulation that employs reward penalties over states augmented with costs to compute safe policies. Unlike standard primal-dual methods, our approach penalizes only infeasible trajectories through state augmentation. This ensures that increasing the penalty parameter always guarantees a feasible policy, a feature lacking in primal-dual methods. Our approach exhibits strong empirical performance and theoretical properties, offering a fresh paradigm for solving complex Constrained RL problems, including rich constraints like expected cost, Value at Risk, and Conditional Value at Risk. Our experimental results demonstrate superior performance compared to leading approaches across various constraint types on multiple benchmark problems.



Paperid:2202
Authors:Junli Jiang, Pavel Naumov
Southwest University, University of Southampton
Abstract:
In many realworld situations, there is often not enough information to know that a certain strategy will succeed in achieving the goal, but there is a good reason to believe that it will. The paper introduces the term "doxastic" for such strategies. The main technical contribution is a sound and complete logical system that describes the interplay between doxastic strategy and belief modalities.



Paperid:2203
Authors:Zi Liang, Pinghui Wang, Ruofei Zhang, Nuo Xu, Shuo Zhang, Lifeng Xing, Haitao Bai, Ziyang Zhou
Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University, Xi'an Jiaotong University
Abstract:
The drastic increase in language models' parameters has led to a new trend of deploying models in cloud servers, raising growing concerns about private inference for Transformerbased models. Existing two-party privacy-preserving techniques, however, only take into account natural language understanding (NLU) scenarios. Private inference in natural language generation (NLG), crucial for applications like translation and code completion, remains underexplored. In addition, previous privacy-preserving techniques suffer from convergence issues during model training and exhibit poor inference speed when used with NLG models due to the neglect of time-consuming operations in auto-regressive generations. To address these issues, we propose a fast private text generation framework for Transformer-based language models, namely MERGE. MERGE reuses the output hidden state as the word embedding to bypass the embedding computation and reorganize the linear operations in the Transformer module to accelerate the forward procedure. Extensive experiments show that MERGE achieves a 26.5x speedup to the vanilla encrypted model under the sequence length 512, and reduces 80% communication cost, with an up to 10x speedup to state-of-the-art approximated models.



Paperid:2204
Authors:Xinwei Liu, Xiaojun Jia, Jindong Gu, Yuan Xun, Siyuan Liang, Xiaochun Cao
SKLOIS, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, Nanyang Technological University, Singapore, University of Oxford, UK, SKLOIS, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences, School of Computing, National University of Singapore, Singapore, School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen, China
Abstract:
The field of fewshot learning (FSL) has shown promising results in scenarios where training data is limited, but its vulnerability to backdoor attacks remains largely unexplored. We first explore this topic by first evaluating the performance of the existing backdoor attack methods on few-shot learning scenarios. Unlike in standard supervised learning, existing backdoor attack methods failed to perform an effective attack in FSL due to two main issues. Firstly, the model tends to overfit to either benign features or trigger features, causing a tough trade-off between attack success rate and benign accuracy. Secondly, due to the small number of training samples, the dirty label or visible trigger in the support set can be easily detected by victims, which reduces the stealthiness of attacks. It seemed that FSL could survive from backdoor attacks. However, in this paper, we propose the Few-shot Learning Backdoor Attack (FLBA) to show that FSL can still be vulnerable to backdoor attacks. Specifically, we first generate a trigger to maximize the gap between poisoned and benign features. It enables the model to learn both benign and trigger features, which solves the problem of overfitting. To make it more stealthy, we hide the trigger by optimizing two types of imperceptible perturbation, namely attractive and repulsive perturbation, instead of attaching the trigger directly. Once we obtain the perturbations, we can poison all samples in the benign support set into a hidden poisoned support set and fine-tune the model on it. Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms while preserving clean accuracy and maintaining stealthiness. This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.



Paperid:2205
Authors:Di Mi, Yanjun Zhang, Leo Yu Zhang, Shengshan Hu, Qi Zhong, Haizhuan Yuan, Shirui Pan
Xiangtan University, University of Technology Sydney, Griffith University, Huazhong University of Science and Technology, City University of Macau, Xiangtan University, Griffith University
Abstract:
Model extraction attacks (MEAs) enable an attacker to replicate the functionality of a victim deep neural network (DNN) model by only querying its API service remotely, posing a severe threat to the security and integrity of payper-query DNN-based services. Although the majority of current research on MEAs has primarily concentrated on neural classifiers, there is a growing prevalence of image-to-image translation (I2IT) tasks in our everyday activities. However, techniques developed for MEA of DNN classifiers cannot be directly transferred to the case of I2IT, rendering the vulnerability of I2IT models to MEA attacks often underestimated. This paper unveils the threat of MEA in I2IT tasks from a new perspective. Diverging from the traditional approach of bridging the distribution gap between attacker queries and victim training samples, we opt to mitigate the effect caused by the different distributions, known as the domain shift. This is achieved by introducing a new regularization term that penalizes high-frequency noise, and seeking a flatter minimum to avoid overfitting to the shifted distribution. Extensive experiments on different image translation tasks, including image super-resolution and style transfer, are performed on different backbone victim models, and the new design consistently outperforms the baseline by a large margin across all metrics. A few real-life I2IT APIs are also verified to be extremely vulnerable to our attack, emphasizing the need for enhanced defenses and potentially revised API publishing policies.



Paperid:2206
Authors:Tao Qi, Huili Wang, Yongfeng Huang
Tsinghua University, Tsinghua University, Tsinghua University Zhongguancun Laboratory
Abstract:
Robustness and privacy protection are two important factors of trustworthy federated learning (FL). Existing FL works usually secure data privacy by perturbing local model gradients via the differential privacy (DP) technique, or defend against poisoning attacks by filtering the local gradients in the outlier of the gradient distribution before aggregation. However, these two issues are often addressed independently in existing works, and how to secure federated learning in both privacy and robustness still needs further exploration. In this paper, we unveil that although DP noisy perturbation can improve the learning robustness, DPFL frameworks are not inherently robust and are vulnerable to a carefully-designed attack method. Furthermore, we reveal that it is challenging for existing robust FL methods to defend against attacks on DP-FL. This can be attributed to the fact that the local gradients of DP-FL are perturbed by random noise, and the selected central gradients inevitably incorporate a higher proportion of poisoned gradients compared to conventional FL. To address this problem, we further propose a new defense method for DP-FL (named Robust-DPFL), which can effectively distinguish poisoned and clean local gradients in DP-FL and robustly update the global model. Experiments on three benchmark datasets demonstrate that baseline methods cannot ensure task accuracy, data privacy, and robustness simultaneously, while Robust-DPFL can effectively enhance the privacy protection and robustness of federated learning meanwhile maintain the task performance.



Paperid:2207
Authors:Qi SHI
University of Southampton
Abstract:
Two different forms of responsibility, counterfactual and seeingto-it, have been extensively discussed in philosophy and AI in the context of a single agent or multiple agents acting simultaneously. Although the generalisation of counterfactual responsibility to a setting where multiple agents act in some order is relatively straightforward, the same cannot be said about seeing-to-it responsibility. Two versions of seeing-to-it modality applicable to such settings have been proposed in the literature. Neither of them perfectly captures the intuition of responsibility. The paper proposes a definition of seeing-to-it responsibility for such settings that amalgamate the two modalities. The paper shows that the newly proposed notion of responsibility and counterfactual responsibility are not definable through each other and studies the responsibility gap for these two forms of responsibility. It shows that although these two forms of responsibility are not enough to ascribe responsibility in each possible situation, this gap does not exist if higher-order responsibility is taken into account.



Paperid:2208
Authors:Daman Deep Singh, Amit Kumar, Abhijnan Chakraborty
Indian Institute of Technology Delhi, India, Indian Institute of Technology Delhi, India, Indian Institute of Technology Delhi, India
Abstract:
The kSERVER problem is one of the most prominent problems in online algorithms with several variants and extensions. However, simplifying assumptions like instantaneous server movements and zero service time has hitherto limited its applicability to real-world problems. In this paper, we introduce a realistic generalization of k-SERVER without such assumptions – the k-FOOD problem, where requests with source-destination locations and an associated pickup time window arrive in an online fashion, and each has to be served by exactly one of the available k servers. The k-FOOD problem offers the versatility to model a variety of real-world use cases such as food delivery, ride sharing, and quick commerce. Moreover, motivated by the need for fairness in online platforms, we introduce the FAIR k-FOOD problem with the max-min objective. We establish that both k-FOOD and FAIR k-FOOD problems are strongly NP-hard and develop an optimal offline algorithm that arises naturally from a time-expanded flow network. Subsequently, we propose an online algorithm DOC4FOOD involving virtual movements of servers to the nearest request location. Experiments on a real-world food-delivery dataset, alongside synthetic datasets, establish the efficacy of the proposed algorithm against state-of-the-art fair food delivery algorithms.



Paperid:2209
Authors:Taylor Sorensen, Liwei Jiang, Jena D. Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, Maarten Sap, John Tasioulas, Yejin Choi
University of Washington Allen Institute for AI, University of Washington Allen Institute for AI, Allen Institute for AI, Allen Institute for AI, University of Washington Allen Institute for AI, University of Washington Allen Institute for AI, Allen Institute for AI, University of Washington Allen Institute for AI, University of Washington, Allen Institute for AI, Carnegie Mellon University Allen Institute for AI, University of Oxford, University of Washington Allen Institute for AI
Abstract:
Human values are crucial to human decisionmaking. Value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect their feelings, how does one balance honesty with friendship?). As statistical learners, AI systems fit to averages by default, washing out these potentially irreducible value conflicts. To improve AI systems to better reflect value pluralism, the first-order challenge is to explore the extent to which AI systems can model pluralistic human values, rights, and duties as well as their interaction. We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations. ValuePrism’s contextualized values are generated by GPT-4 and deemed high-quality by human annotators 91% of the time. We conduct a large-scale study with annotators across diverse social and demographic backgrounds to try to understand whose values are represented. With ValuePrism, we build Value Kaleidoscope (or Kaleido), an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence (i.e., support or oppose) of human values, rights, and duties within a specific context. Humans prefer the sets of values output by our system over the teacher GPT- 4, finding them more accurate and with broader coverage. In addition, we demonstrate that Kaleido can help explain variability in human decision-making by outputting contrasting values. Finally, we show that Kaleido’s representations transfer to other philosophical frameworks and datasets, confirming the benefit of an explicit, modular, and interpretable approach to value pluralism. We hope that our work will serve as a step to making more explicit the implicit values behind human decision-making and to steering AI systems to make decisions that are more in accordance with them.



Paperid:2210
Authors:Jazon Szabo, Natalia Criado, Jose Such, Sanjay Modgil
King's College London, VRAIN, Universitat Politecnica de Valencia, King's College London VRAIN, Universitat Politecnica de Valencia, King's College London
Abstract:
While there is universal agreement that agents ought to act ethically, there is no agreement as to what constitutes ethical behaviour. To address this problem, recent philosophical approaches to `moral uncertainty' propose aggregation of multiple ethical theories to guide agent behaviour. However, one of the foundational proposals for aggregation Maximising Expected Choiceworthiness (MEC) - has been criticised as being vulnerable to fanaticism; the problem of an ethical theory dominating agent behaviour despite low credence (confidence) in said theory. Fanaticism thus undermines the `democratic' motivation for accommodating multiple ethical perspectives. The problem of fanaticism has not yet been mathematically defined. Representing moral uncertainty as an instance of social welfare aggregation, this paper contributes to the field of moral uncertainty by 1) formalising the problem of fanaticism as a property of social welfare functionals and 2) providing non-fanatical alternatives to MEC, i.e. Highest k-trimmed Mean and Highest Median.



Paperid:2211
Authors:Ritwik Vashistha, Arya Farahi
University of Texas at Austin, University of Texas at Austin
Abstract:
With growing concerns regarding bias and discrimination in predictive models, the AI community has increasingly focused on assessing AI system trustworthiness. Conventionally, trustworthy AI literature relies on the probabilistic framework and calibration as prerequisites for trustworthiness. In this work, we depart from this viewpoint by proposing a novel trust framework inspired by the philosophy literature on trust. We present a precise mathematical definition of trustworthiness, termed Utrustworthiness, specifically tailored for a subset of tasks aimed at maximizing a utility function. We argue that a model’s U-trustworthiness is contingent upon its ability to maximize Bayes utility within this task subset. Our first set of results challenges the probabilistic framework by demonstrating its potential to favor less trustworthy models and introduce the risk of misleading trustworthiness assessments. Within the context of U-trustworthiness, we prove that properly-ranked models are inherently U-trustworthy. Furthermore, we advocate for the adoption of the AUC metric as the preferred measure of trustworthiness. By offering both theoretical guarantees and experimental validation, AUC enables robust evaluation of trustworthiness, thereby enhancing model selection and hyperparameter tuning to yield more trustworthy outcomes.



Paperid:2212
Authors:Mengjie Wu, Jingui Ma, Run Wang, Sidan Zhang, Ziyou Liang, Boheng Li, Chenhao Lin, Liming Fang, Lina Wang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China, Xi'an Jiaotong University, Nanjing University of Aeronautics and Astronautics, Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China Zhengzhou Xinda Institute of Advanced Technology
Abstract:
In recent few years, DeepFakes are posing serve threats and concerns to both individuals and celebrities, as realistic DeepFakes facilitate the spread of disinformation. Model attribution techniques aim at attributing the adopted forgery models of DeepFakes for provenance purposes and providing explainable results to DeepFake forensics. However, the existing model attribution techniques rely on the trace left in the DeepFake creation, which can become futile if such traces were disrupted. Motivated by our observation that certain traces served for model attribution appeared in both the highfrequency and low-frequency domains and play a divergent role in model attribution. In this work, for the first time, we propose a novel training-free evasion attack, TraceEvader, in the most practical non-box setting. Specifically, TraceEvader injects a universal imitated traces learned from wild DeepFakes into the high-frequency component and introduces adversarial blur into the domain of the low-frequency component, where the added distortion confuses the extraction of certain traces for model attribution. The comprehensive evaluation on 4 state-of-the-art (SOTA) model attribution techniques and fake images generated by 8 generative models including generative adversarial networks (GANs) and diffusion models (DMs) demonstrates the effectiveness of our method. Overall, our TraceEvader achieves the highest average attack success rate of 79% and is robust against image transformations and dedicated denoising techniques as well where the average attack success rate is still around 75%. Our TraceEvader confirms the limitations of current model attribution techniques and calls the attention of DeepFake researchers and practitioners for more robust-purpose model attribution techniques.



Paperid:2213
Authors:Yi Xie, Jie Zhang, Shiqian Zhao, Tianwei Zhang, Xiaofeng Chen
Xidian University, Nanyang Technological University, Nanyang Technological University, Nanyang Technological University, Xidian University
Abstract:
While deep learning models have shown significant performance across various domains, their deployment needs extensive resources and advanced computing infrastructure. As a solution, Machine Learning as a Service (MLaaS) has emerged, lowering the barriers for users to release or productize their deep learning models. However, previous studies have highlighted potential privacy and security concerns associated with MLaaS, and one primary threat is model extraction attacks. To address this, there are many defense solutions but they suffer from unrealistic assumptions and generalization issues, making them less practical for reliable protection. Driven by these limitations, we introduce a novel defense mechanism, SAME, based on the concept of sample reconstruction. This strategy imposes minimal prerequisites on the defender's capabilities, eliminating the need for auxiliary Outof-Distribution (OOD) datasets, user query history, white-box model access, and additional intervention during model training. It is compatible with existing active defense methods. Our extensive experiments corroborate the superior efficacy of SAME over state-of-the-art solutions. Our code is available at https://github.com/xythink/SAME.



Paperid:2214
Authors:Zipeng Ye, Wenjian Luo, Qi Zhou, Yubo Tang
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies Peng Cheng Laboratory, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
Abstract:
Distributed learning frameworks aim to train global models by sharing gradients among clients while preserving the data privacy of each individual client. However, extensive research has demonstrated that these learning frameworks do not absolutely ensure the privacy, as training data can be reconstructed from shared gradients. Nevertheless, the existing privacybreaking attack methods have certain limitations. Some are applicable only to small models, while others can only recover images in small batch size and low resolutions, or with low fidelity. Furthermore, when there are some data with the same label in a training batch, existing attack methods usually perform poorly. In this work, we successfully address the limitations of existing attacks by two steps. Firstly, we model the coefficient of variation (CV) of features and design an evolutionary algorithm based on the minimum CV to accurately reconstruct the labels of all training data. After that, we propose a stepwise gradient inversion attack, which dynamically adapts the objective function, thereby effectively and rationally promoting the convergence of attack results towards an optimal solution. With these two steps, our method is able to recover high resolution images (224*224 pixel, from ImageNet and Web) with high fidelity in distributed learning scenarios involving complex models and larger batch size. Experiment results demonstrate the superiority of our approach, reveal the potential vulnerabilities of the distributed learning paradigm, and emphasize the necessity of developing more secure mechanisms. Source code is available at https://github.com/MiLab-HITSZ/2023YeHFGradInv.



Paperid:2215
Authors:Dapeng Zhi, Peixin Wang, Cheng Chen, Min Zhang
Shanghai Key Laboratory for Trustworthy Computing, East China Normal University, University of Oxford, Shanghai Key Laboratory for Trustworthy Computing, East China Normal University, Shanghai Key Laboratory for Trustworthy Computing, East China Normal University
Abstract:
Deep Reinforcement Learning (DRL) has gained prominence as an effective approach for control systems. However, its practical deployment is impeded by state perturbations that can severely impact system performance. Addressing this critical challenge requires robustness verification about system performance, which involves tackling two quantitative questions: (i) how to establish guaranteed bounds for expected cumulative rewards, and (ii) how to determine tail bounds for cumulative rewards. In this work, we present the first approach for robustness verification of DRLbased control systems by introducing reward martingales, which offer a rigorous mathematical foundation to characterize the impact of state perturbations on system performance in terms of cumulative rewards. Our verified results provide provably quantitative certificates for the two questions. We then show that reward martingales can be implemented and trained via neural networks, against different types of control policies. Experimental results demonstrate that our certified bounds tightly enclose simulation outcomes on various DRL-based control systems, indicating the effectiveness and generality of the proposed approach.



Paperid:2216
Authors:Huixin Zhong, Eamonn O'Neill, Janina A. Hoffmann
Centre for Doctoral Training in Accountable, Responsible and Transparent AI, University of Bath, Centre for Doctoral Training in Accountable, Responsible and Transparent AI, University of Bath, Centre for Doctoral Training in Accountable, Responsible and Transparent AI, University of Bath
Abstract:
Article 5 of the European Union’s Artificial Intelligence Act is intended to regulate AI use to prevent potentially harmful consequences. Nevertheless, applying this legislation practically is likely to be challenging because of ambiguously used terminologies and because it fails to specify which manipulation techniques may be invoked by AI, potentially leading to significant harm. This paper aims to bridge this gap by defining key terms and demonstrating how AI may invoke these techniques, drawing from insights in psychology and behavioural economics. First, this paper provides definitions of the terms “subliminal techniques”, “manipulative techniques” and “deceptive techniques”. Secondly, we identified from the literature in cognitive psychology and behavioural economics three subliminal and five manipulative techniques and exemplify how AI might implement these techniques to manipulate users in realworld case scenarios. These illustrations may serve as a practical guide for stakeholders to detect cases of AI manipulation and consequently devise preventive measures. Article 5 has also been criticised for offering inadequate protection. We critically assess the protection offered by Article 5, proposing specific revisions to paragraph 1, points (a) and (b) of Article 5 to increase its protective effectiveness.



Paperid:2217
Authors:Zhanpeng Zhou, Wen Shen, Huixin Chen, Ling Tang, Yuefeng Chen, Quanshi Zhang
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Alibaba Group, Shanghai Jiao Tong University
Abstract:
We prove that when we do the Taylor series expansion of the loss function, the BN operation will block the influence of the firstorder term and most influence of the second-order term of the loss. We also find that such a problem is caused by the standardization phase of the BN operation. We believe that proving the blocking of certain loss terms provides an analytic perspective for potential detects of a deep model with BN operations, although the blocking problem is not fully equivalent to significant damages in all tasks on benchmark datasets. Experiments show that the BN operation significantly affects feature representations in specific tasks.



Paperid:2218
Authors:Tsz-Chiu Au
Ulsan National Institute of Science and Technology
Abstract:
Existing works on goal recognition design (GRD) consider the underlying domain as a classical planning domain and apply modifications to the domain to minimize the worst case distinctiveness. In this paper, we propose replacing existing modifications with blocks, which group several closely related modifications together such that a block can modify a region in a search space with respect to some design constraints. Moreover, there could be blocks within blocks such that the design space becomes hierarchical for modifications at different levels of granularity. We present 1) a new version of prunedreduce, a successful pruning rule for GRD, for block-level GRD, and 2) a new pruning rule for pruning some branches in both hierarchical and non-hierarchical design space. Our experiments show that searching in hierarchical design spaces greatly speeds up the redesign process.



Paperid:2219
Authors:Pascal Bachor, Gregor Behnke
University of Freiburg, Universiteit van Amsterdam
Abstract:
Domain learning is the task of finding an action model that can explain given observed plan executions, socalled traces. It allows us to automate the identification of actions' preconditions and effects instead of relying on hand-modeled expert knowledge. While previous research has put forth various techniques and covers multiple planning formalisms, the theoretical foundations of domain learning are still in their infancy. We investigate the most basic setting, that is grounded classical planning without negative preconditions or conditional effects with full observability of the state variables. The given traces are assumed to be justified in the sense that either no single action or no set of actions can be removed without violating correctness of the plan. Furthermore, we might be given additional constraints in the form of a propositional logical formula. We show the consequences of these assumptions for the computational complexity of identifying a satisfactory planning domain.



Paperid:2220
Authors:Luigi Bonassi, Alfonso Emilio Gerevini, Enrico Scala
Università degli Studi di Brescia, Università degli Studi di Brescia, Università degli Studi di Brescia
Abstract:
This paper studies an approach to planning with PDDL3 constraints involving mixed propositional and numeric conditions, as well as metric time constraints. We show how the whole PDDL3 with instantaneous actions can be compiled away into a numeric planning problem without PDDL3 constraints, enabling the use of any stateof-the-art numeric planner that is agnostic to the existence of PDDL3. Our solution exploits the concept of regression. In addition to a basic compilation, we present an optimized variant based on the observation that it is possible to make the compilation sensitive to the structure of the problem to solve; this can be done by reasoning on the interactions between the problem actions and the constraints. The resulting optimization substantially reduces the size of the planning task. We experimentally observe that our approach significantly outperforms existing state-of-the-art planners supporting the same class of constraints over known benchmark domains, settling a new state-of-the-art planning system for PDDL3.



Paperid:2221
Authors:Cornelius Brand, Robert Ganian, Subrahmanyam Kalyanasundaram, Fionn Mc Inerney
Algorithms & Complexity Theory Group, Regensburg University, Germany, Algorithms and Complexity Group, TU Wien, Austria, Department of Computer Science and Engineering, IIT Hyderabad, India, Algorithms and Complexity Group, TU Wien, Austria
Abstract:
Atomic congestion games are a classic topic in network design, routing, and algorithmic game theory, and are capable of modeling congestion and flow optimization tasks in various application areas. While both the price of anarchy for such games as well as the computational complexity of computing their Nash equilibria are by now wellunderstood, the computational complexity of computing a system-optimal set of strategies - that is, a centrally planned routing that minimizes the average cost of agents - is severely understudied in the literature. We close this gap by identifying the exact boundaries of tractability for the problem through the lens of the parameterized complexity paradigm. After showing that the problem remains highly intractable even on extremely simple networks, we obtain a set of results which demonstrate that the structural parameters which control the computational (in)tractability of the problem are not vertex-separator based in nature (such as, e.g., treewidth), but rather based on edge separators. We conclude by extending our analysis towards the (even more challenging) min-max variant of the problem.



Paperid:2222
Authors:Matthew Budd, Bruno Lacerda, Nick Hawes
Oxford Robotics Institute, University of Oxford, Oxford Robotics Institute, University of Oxford, Oxford Robotics Institute, University of Oxford
Abstract:
The metareasoning framework aims to enable autonomous agents to factor in planning costs when making decisions. In this work, we develop the first nonmyopic metareasoning algorithm for planning with Markov decision processes. Our method learns the behaviour of anytime probabilistic planning algorithms from performance data. Specifically, we propose a novel model for metareasoning, based on contextual performance profiles that predict the value of the planner's current solution given the time spent planning, the state of the planning algorithm's internal parameters, and the difficulty of the planning problem being solved. This model removes the need to assume that the current solution quality is always known, broadening the class of metareasoning problems that can be addressed. We then employ deep reinforcement learning to learn a policy that decides, at each timestep, whether to continue planning or start executing the current plan, and how to set hyperparameters of the planner to enhance its performance. We demonstrate our algorithm's ability to perform effective metareasoning in two domains.



Paperid:2223
Authors:Turgay Caglar, Sirine Belhaj, Tathagata Chakraborty, Michael Katz, Sarath Sreedharan
Colorado State University, Ecole Polytechnique de Tunisie, IBM Research, IBM Research, Colorado State University
Abstract:
This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks. To set the stage for this union, we explore two different flavors of model space problems that have been studied in the AI planning literature and explore the effect of an LLM on those tasks. We empirically demonstrate how the performance of an LLM contrasts with combinatorial search (CS) – an approach that has been traditionally used to solve model space tasks in planning, both with the LLM in the role of a standalone model space reasoner as well as in the role of a statistical signal in concert with the CS approach as part of a twostage process. Our experiments show promising results suggesting further forays of LLMs into the exciting world of model space reasoning for planning tasks in the future.



Paperid:2224
Authors:Matteo Cardellini, Enrico Giunchiglia, Marco Maratea
DAUIN, Politecnico di Torino, Italy DIBRIS, Universit`a di Genova, Italy, DIBRIS, Universit`a di Genova, Italy, DeMaCS, Universit`a della Calabria, Italy
Abstract:
In this paper, we propose a novel approach for solving linear numeric planning problems, called Symbolic Pattern Planning. Given a planning problem Pi, a bound n and a pattern defined as an arbitrary sequence of actions-- we encode the problem of finding a plan for Pi with bound n as a formula with fewer variables and/or clauses than the state-of-the-art rolled-up and relaxed-relaxed-exists encodings. More importantly, we prove that for any given bound, it is never the case that the latter two encodings allow finding a valid plan while ours does not. On the experimental side, we consider 6 other planning systems --including the ones which participated in this year's International Planning Competition (IPC)-- and we show that our planner Patty has remarkably good comparative performances on this year's IPC problems.



Paperid:2225
Authors:Dillon Z. Chen, Sylvie Thiébaux, Felipe Trevizan
School of Computing, The Australian National University LAAS-CNRS, Université de Toulouse, School of Computing, The Australian National University LAAS-CNRS, Université de Toulouse, School of Computing, The Australian National University
Abstract:
We present three novel graph representations of planning tasks suitable for learning domainindependent heuristics using Graph Neural Networks (GNNs) to guide search. In particular, to mitigate the issues caused by large grounded GNNs we present the first method for learning domain-independent heuristics with only the lifted representation of a planning task. We also provide a theoretical analysis of the expressiveness of our models, showing that some are more powerful than STRIPS-HGN, the only other existing model for learning domain-independent heuristics. Our experiments show that our heuristics generalise to much larger problems than those in the training set, vastly surpassing STRIPS-HGN heuristics.



Paperid:2226
Authors:Kyungjin Cho, Jihun Shin, Eunjin Oh
POSTECH, POSTECH, POSTECH
Abstract:
In this paper, we present approximate distance and shortestpath oracles for fault-tolerant Euclidean spanners motivated by the routing problem in real-world road networks. A fault-tolerant Euclidean spanner for a set of points in Euclidean space is a graph in which, despite the deletion of small number of any points, the distance between any two points in the damaged graph is an approximation of their Euclidean distance. Given a fault-tolerant Euclidean spanner and a small approximation factor, our data structure allows us to compute an approximate distance between two points in the damaged spanner in constant time when a query involves any two points and a small set of failed points. Additionally, by incorporating additional data structures, we can return a path itself in time almost linear in the length of the returned path. Both data structures require near-linear space.



Paperid:2227
Authors:Mojtaba Elahi, Jussi Rintanen
Aalto University, Aalto University
Abstract:
Most planners are based on grounding, that is, generating all instances of a parameterized action during a preprocessing phase. For some problems the number of ground actions is too high, causing a performance bottleneck. Building upon an existing approach, we present an enhanced method to split action schemas automatically during the grounding phase, to reduce the number of ground actions. First, we propose to exploit the structural knowledge of the problems to have a more informative dependency graph. Then, we suggest a better objective function to define and choose the best split. Finally, we present a more effective search to find it. We experimentally measure the impact of each of these improvements, and show that our approach significantly outperforms the state of the art.



Paperid:2228
Authors:Alfonso Emilio Gerevini, Francesco Percassi, Enrico Scala
Università degli Studi di Brescia, University of Huddersfield, Università degli Studi di Brescia
Abstract:
The paper introduces a novel polynomial compilation technique for the sound and complete removal of conditional effects in classical planning problems. Similar to Nebel's polynomial compilation of conditional effects, our solution also decomposes each action with conditional effects into several simpler actions. However, it does so more effectively by exploiting the actual structure of the given conditional effects. We characterise such a structure using a directed graph and leverage it to significantly reduce the number of additional atoms required, thereby shortening the size of valid plans. Our experimental analysis indicates that this approach enables the effective use of polynomial compilations, offering benefits in terms of modularity and reusability of existing planners. It also demonstrates that a compilationbased approach can be more efficient, either independently or in synergy with state-of-the-art optimal planners that directly support conditional effects.



Paperid:2229
Authors:Jigyasa Gupta, Shreya Sharma, Shreshth Tuli, Rohan Paul, Mausam
Indian Institute of Technology Delhi, India Samsung R&D Institute Delhi, India, Indian Institute of Technology Delhi, India, Happening Technology, UK, Indian Institute of Technology Delhi, India, Indian Institute of Technology Delhi, India
Abstract:
Our goal is to enable a robot to learn how to sequence its actions to perform highlevel tasks specified as natural language instructions, given successful demonstrations from a human partner. Our novel neuro-symbolic solution GOALNET builds an iterative two-step approach that interleaves (i) inferring next subgoal predicate implied by the language instruction, for a given world state, and (ii) synthesizing a feasible subgoal-reaching plan from that state. The agent executes the plan, and the two steps are repeated. GOALNET combines (i) learning, where dense representations are acquired for language instruction and the world state via a neural network prediction model, enabling generalization to novel settings and (ii) planning, where the cause-effect modeling by a classical planner eschews irrelevant predicates, facilitating multi-stage decision making in large domains. GOALNET obtains 78% improvement in the goal reaching rate in comparison to several state-of-the-art approaches on benchmark data with multi-stage instructions. Further, GOALNET can generalize to novel instructions for scenes with unseen objects. Source code available at https://github. com/reail-iitd/goalnet.



Paperid:2230
Authors:Rishi Hazra, Pedro Zuidberg Dos Martires, Luc De Raedt
Örebro University, Sweden, Örebro University, Sweden, Örebro University, Sweden KU Leuven, Belgium
Abstract:
Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast "world knowledge". Yet, obtaining plans that are both feasible (grounded in affordances) and costeffective (in plan length), remains a challenge, despite recent progress. This contrasts with heuristic planning methods that employ domain knowledge (formalized in action models such as PDDL) and heuristic search to generate feasible, optimal plans. Inspired by this, we propose to combine the power of LLMs and heuristic planning by leveraging the world knowledge of LLMs and the principles of heuristic search. Our approach, SayCanPay, employs LLMs to generate actions (Say) guided by learnable domain knowledge, that evaluates actions' feasibility (Can) and long-term reward/payoff (Pay), and heuristic search to select the best sequence of actions. Our contributions are (1) a novel framing of the LLM planning problem in the context of heuristic planning, (2) integrating grounding and cost-effective elements into the generated plans, and (3) using heuristic search over actions. Our extensive evaluations show that our model surpasses other LLM planning approaches.



Paperid:2231
Authors:Marcus Hoerger, Hanna Kurniawati, Dirk Kroese, Nan Ye
The University of Queensland, Australian National University, The University of Queensland, The University of Queensland
Abstract:
The Partially Observable Markov Decision Process (POMDP) provides a principled framework for decision making in stochastic partially observable environments. However, computing good solutions for problems with continuous action spaces remains challenging. To ease this challenge, we propose a simple online POMDP solver, called Lazy CrossEntropy Search Over Policy Trees (LCEOPT). At each planning step, our method uses a novel lazy Cross-Entropy method to search the space of policy trees, which provide a simple policy representation. Specifically, we maintain a distribution on promising finite-horizon policy trees. The distribution is iteratively updated by sampling policies, evaluating them via Monte Carlo simulation, and refitting them to the top-performing ones. Our method is lazy in the sense that it exploits the policy tree representation to avoid redundant computations in policy sampling, evaluation, and distribution update. This leads to computational savings of up to two orders of magnitude. Our LCEOPT is surprisingly simple as compared to existing state-of-the-art methods, yet empirically outperforms them on several continuous-action POMDP problems, particularly for problems with higher-dimensional action spaces.



Paperid:2232
Authors:David Klaška, Antonín Kučera, Vojtěch Kůr, Vít Musil, Vojtěch Řehák
Masaryk University, Masaryk University, Masaryk University, Masaryk University, Masaryk University
Abstract:
Longrun average optimization problems for Markov decision processes (MDPs) require constructing policies with optimal steady-state behavior, i.e., optimal limit frequency of visits to the states. However, such policies may suffer from local instability in the sense that the frequency of states visited in a bounded time horizon along a run differs significantly from the limit frequency. In this work, we propose an efficient algorithmic solution to this problem.



Paperid:2233
Authors:Farnaz Kohankhaki, Kiarash Aghakasiri, Hongming Zhang, Ting-Han Wei, Chao Gao, Martin Müller
University of Alberta, University of Alberta Edmonton Research Center, Huawei Canada, University of Alberta, University of Alberta, Edmonton Research Center, Huawei Canada, University of Alberta
Abstract:
Monte Carlo Tree Search (MCTS) is an immensely popular searchbased framework used for decision making. It is traditionally applied to domains where a perfect simulation model of the environment is available. We study and improve MCTS in the context where the environment model is given but imperfect. We show that the discrepancy between the model and the actual environment can lead to significant performance degradation with standard MCTS. We therefore develop Uncertainty Adapted MCTS (UA-MCTS), a more robust algorithm within the MCTS framework. We estimate the transition uncertainty in the given model, and direct the search towards more certain transitions in the state space. We modify all four MCTS phases to improve the search behavior by considering these estimates. We prove, in the corrupted bandit case, that adding uncertainty information to adapt UCB leads to tighter regret bound than standard UCB. Empirically, we evaluate UA-MCTS and its individual components on the deterministic domains from the MinAtar test suite. Our results demonstrate that UA-MCTS strongly improves MCTS in the presence of model transition errors.



Paperid:2234
Authors:Hai S. Le, Brendan Juba, Roni Stern
Washington University in St Louis, Washington University in St Louis, BGU
Abstract:
A common approach for solving planning problems is to model them in a formal language such as the Planning Domain Definition Language (PDDL), and then use an appropriate PDDL planner. Several algorithms for learning PDDL models from observations have been proposed but plans created with these learned models may not be sound. We propose two algorithms for learning PDDL models that are guaranteed to be safe to use even when given observations that include partially observable states. We analyze these algorithms theoretically, characterizing the sample complexity each algorithm requires to guarantee probabilistic completeness. We also show experimentally that our algorithms are often better than FAMA, a stateof-the-art PDDL learning algorithm.



Paperid:2235
Authors:Chao Lei, Nir Lipovetzky, Krista A. Ehinger
The University of Melbourne, The University of Melbourne, The University of Melbourne
Abstract:
The ion and Reasoning Corpus (ARC) is a general artificial intelligence benchmark that poses difficulties for pure machine learning methods due to its requirement for fluid intelligence with a focus on reasoning and abstraction. In this work, we introduce an ARC solver, Generalized Planning for Reasoning (GPAR). It casts an ARC problem as a generalized planning (GP) problem, where a solution is formalized as a planning program with pointers. We express each ARC problem using the standard Planning Domain Definition Language (PDDL) coupled with external functions representing objectcentric abstractions. We show how to scale up GP solvers via domain knowledge specific to ARC in the form of restrictions over the actions model, predicates, arguments and valid structure of planning programs. Our experiments demonstrate that GPAR outperforms the state-of-the-art solvers on the object-centric tasks of the ARC, showing the effectiveness of GP and the expressiveness of PDDL to model ARC problems. The challenges provided by the ARC benchmark motivate research to advance existing GP solvers and understand new relations with other planning computational models. Code is available at github.com/you68681/GPAR.



Paperid:2236
Authors:Idan Lev-Yehudi, Moran Barenboim, Vadim Indelman
Technion Autonomous Systems Program (TASP), Technion - Israel Institute of Technology, Haifa 32000, Israel, Technion Autonomous Systems Program (TASP), Technion - Israel Institute of Technology, Haifa 32000, Israel, Department of Aerospace Engineering, Technion - Israel Institute of Technology, Haifa 32000, Israel
Abstract:
Solving partially observable Markov decision processes (POMDPs) with high dimensional and continuous observations, such as camera images, is required for many real life robotics and planning problems. Recent researches suggested machine learned probabilistic models as observation models, but their use is currently too computationally expensive for online deployment. We deal with the question of what would be the implication of using simplified observation models for planning, while retaining formal guarantees on the quality of the solution. Our main contribution is a novel probabilistic bound based on a statistical total variation distance of the simplified model. We show that it bounds the theoretical POMDP value w.r.t. original model, from the empirical planned value with the simplified model, by generalizing recent results of particlebelief MDP concentration bounds. Our calculations can be separated into offline and online parts, and we arrive at formal guarantees without having to access the costly model at all during planning, which is also a novel result. Finally, we demonstrate in simulation how to integrate the bound into the routine of an existing continuous online POMDP solver.



Paperid:2237
Authors:Longkang Li, Siyuan Liang, Zihao Zhu, Chris Ding, Hongyuan Zha, Baoyuan Wu
The School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China Mohamed bin Zayed University of Artificial Intelligence, UAE, National University of Singapore, Singapore, The School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China, The School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China, The School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China, The School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China
Abstract:
The permutation flow shop scheduling (PFSS), aiming at finding the optimal permutation of jobs, is widely used in manufacturing systems. When solving largescale PFSS problems, traditional optimization algorithms such as heuristics could hardly meet the demands of both solution accuracy and computational efficiency, thus learning-based methods have recently garnered more attention. Some work attempts to solve the problems by reinforcement learning methods, which suffer from slow convergence issues during training and are still not accurate enough regarding the solutions. To that end, we propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Moreover, in order to extract better feature representations of input jobs, we incorporate the graph structure as the encoder. The extensive experiments reveal that our proposed model obtains significant promotion and presents excellent generalizability in large-scale problems with up to 1000 jobs. Compared to the state-of-the-art reinforcement learning method, our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average. The code is available at: https://github.com/longkangli/PFSS-IL.



Paperid:2238
Authors:Ruiqi Li, Leyang Cui, Songtuan Lin, Patrik Haslum
Australian National University, Tencent AI lab, Australian National University, Australian National University
Abstract:
Domain model acquisition has been identified as a bottleneck in the application of planning technology, especially within narrative planning. Learning action models from narrative texts in an automated way is essential to overcome this barrier, but challenging because of the inherent complexities of such texts. We present an evaluation of planning domain models derived from narrative texts using our fully automated, unsupervised system, NaRuto. Our system combines structured event extraction, predictions of commonsense event relations, and textual contradictions and similarities. Evaluation results show that NaRuto generates domain models of significantly better quality than existing fully automated methods, and even sometimes on par with those created by semiautomated methods, with human assistance.



Paperid:2239
Authors:Songtuan Lin, Conny Olz, Malte Helmert, Pascal Bercher
The Australian National University, Ulm University, University of Basel, The Australian National University
Abstract:
In this paper we study the computational complexity of several reasoning tasks centered around the bounded plan existence problem. We do this for standard classical planning and hierarchical task network (HTN) planning and each for a grounded and a lifted representation. Whereas bounded plan existence complexity is known for classical planning, it has not yet been studied for HTN planning. For plan verification, results were available for both formalisms except for the lifted HTN planning. We will present lower and upper bounds of the complexity of plan verification in lifted HTN planning and provide novel insights into its grounded counterpart, in which we show that verification is not just NPcomplete in the general case, but already for a severely restricted special case. Finally, we show the complexity concerning verifying the optimality of a given plan and discuss its connection to the bounded plan existence problem.



Paperid:2240
Authors:Christian Muise, Sheila A. McIlraith, J. Christopher Beck
Queen's University Vector Institute for Artificial Intelligence,, University of Toronto Vector Institute for Artificial Intelligence,, University of Toronto
Abstract:
Fully Observable NonDeterministic (FOND) planning is a variant of classical symbolic planning in which actions are nondeterministic, with an action's outcome known only upon execution. It is a popular planning paradigm with applications ranging from robot planning to dialogue-agent design and reactive synthesis. Over the last 20 years, a number of approaches to FOND planning have emerged. In this work, we establish a new state of the art, following in the footsteps of some of the most powerful FOND planners to date. Our planner, PR2, decisively outperforms the four leading FOND planners, at times by a large margin, in 17 of 18 domains that represent a comprehensive benchmark suite. Ablation studies demonstrate the impact of various techniques we introduce, with the largest improvement coming from our novel FOND-aware heuristic.



Paperid:2241
Authors:Stefan Panjkovic, Andrea Micheli
Fondazione Bruno Kessler University of Trento, Fondazione Bruno Kessler
Abstract:
Given the model of a system with explicit temporal constraints, optimal temporal planning is the problem of finding a schedule of actions that achieves a certain goal while optimizing an objective function. Recent approaches for optimal planning reduce the problem to a series of queries to an Optimization Modulo Theory (OMT) solver: each query encodes a bounded version of the problem, with additional abstract actions representing an overapproximation of the plans beyond the bound. This technique suffers from performance issues, mainly due to the looseness of the over-approximation, which can include many non-executable plans. In this paper, we propose a refined abstraction for solving optimal temporal planning via OMT by introducing abstract scheduling constraints, which have a double purpose. First, they enforce a partial ordering of abstract actions based on mutual dependencies between them, which leads to a better makespan estimation and allows to prove optimality sooner. Second, they implicitly forbid circular self-enabling of abstract actions, which is a common cause of spurious models that severely affects performance in existing approaches. We prove the soundness and completeness of the resulting approach and empirically demonstrate its superiority with respect to the state of the art.



Paperid:2242
Authors:Alberto Pozanco, Ramon Fraga Pereira, Daniel Borrajo
J.P. Morgan AI Research, University of Manchester, J.P. Morgan AI Research
Abstract:
In Environment Design, one interested party seeks to affect another agent's decisions by applying changes to the environment. Most research on planning environment (re)design assumes the interested party's objective is to facilitate the recognition of goals and plans, and search over the space of environment modifications to find the minimal set of changes that simplify those tasks and optimise a particular metric. This search space is usually intractable, so existing approaches devise metricdependent pruning techniques for performing search more efficiently. This results in approaches that are not able to generalise across different objectives and/or metrics. In this paper, we argue that the interested party could have objectives and metrics that are not necessarily related to recognising agents' goals or plans. Thus, to generalise the task of Planning Environment Redesign, we develop a general environment redesign approach that is metric-agnostic and leverages recent research on top-quality planning to efficiently redesign planning environments according to any interested party's objective and metric. Experiments over a set of environment redesign benchmarks show that our general approach outperforms existing approaches when using well-known metrics, such as facilitating the recognition of goals, as well as its effectiveness when solving environment redesign tasks that optimise a novel set of different metrics.



Paperid:2243
Authors:Martín Pozo, Alvaro Torralba, Carlos Linares Lopez
Universidad Carlos III de Madrid, Aalborg University, Universidad Carlos III de Madrid
Abstract:
CounterexampleGuided ion Refinement (CEGAR) is a prominent technique to generate Cartesian abstractions for guiding search in cost- optimal planning. The core idea is to iteratively refine the abstraction, finding a flaw of the current optimal abstract plan. All existing approaches find these flaws by executing the abstract plan using progression in the original state space. Instead, we propose to do backward refinements by using regression from the goals. This results in a new type of flaw, that can identify invalid plan suffixes. The resulting abstractions are less focused on the initial state, but more informative on average, significantly improving the performance of current CEGAR-based techniques. Furthermore, they can be combined with forward refinements in several bidirectional strategies that provide the benefits of both methods.



Paperid:2244
Authors:Johannes Schmalz, Felipe Trevizan
Australian National University, Australian National University
Abstract:
Current methods for solving Stochastic Shortest Path Problems (SSPs) find states’ coststo-go by applying Bellman backups, where state-of-the-art methods employ heuristics to select states to back up and prune. A fundamental limitation of these algorithms is their need to compute the cost-to-go for every applicable action during each state backup, leading to unnecessary computation for actions identified as sub-optimal. We present new connections between planning and operations research and, using this framework, we address this issue of unnecessary computation by introducing an efficient version of constraint generation for SSPs. This technique allows algorithms to ignore sub-optimal actions and avoid computing their costs-to-go. We also apply our novel technique to iLAO* resulting in a new algorithm, CG-iLAO*. Our experiments show that CG-iLAO* ignores up to 57% of iLAO*’s actions and it solves problems up to 8x and 3x faster than LRTDP and iLAO*.



Paperid:2245
Authors:Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B. Tenenbaum, Leslie Kaelbling, Michael Katz
MIT Computer Science and Artificial Intelligence Laboratory, IBM Research, IBM Research, MIT Computer Science and Artificial Intelligence Laboratory, MIT Computer Science and Artificial Intelligence Laboratory, IBM Research
Abstract:
Recent work has considered whether large language models (LLMs) can function as planners: given a task, generate a plan. We investigate whether LLMs can serve as generalized planners: given a domain and training tasks, generate a program that efficiently produces plans for other tasks in the domain. In particular, we consider PDDL domains and use GPT4 to synthesize Python programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program; and (2) automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines. Overall, we find that GPT-4 is a surprisingly powerful generalized planner. We also conclude that automated debugging is very important, that CoT summarization has non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two training tasks are often sufficient for strong generalization.



Paperid:2246
Authors:Jiwoo Son, Minsu Kim, Sanghyeok Choi, Hyeonah Kim, Jinkyoo Park
Korea Advanced Institute of Science and Technology (KAIST) Omelet, Korea Advanced Institute of Science and Technology (KAIST), Korea Advanced Institute of Science and Technology (KAIST), Korea Advanced Institute of Science and Technology (KAIST), Korea Advanced Institute of Science and Technology (KAIST) Omelet
Abstract:
Minmax routing problems aim to minimize the maximum tour length among multiple agents as they collaboratively visit all cities, i.e., the completion time. These problems include impactful real-world applications but are known as NP-hard. Existing methods are facing challenges, particularly in large-scale problems that require the coordination of numerous agents to cover thousands of cities. This paper proposes Equity-Transformer to solve large-scale min-max routing problems. First, we model min-max routing problems into sequential planning, reducing the complexity and enabling the use of a powerful Transformer architecture. Second, we propose key inductive biases that ensure equitable workload distribution among agents. The effectiveness of Equity-Transformer is demonstrated through its superior performance in two representative min-max routing tasks: the min-max multi-agent traveling salesman problem (min-max mTSP) and the min-max multi-agent pick-up and delivery problem (min-max mPDP). Notably, our method achieves significant reductions of runtime, approximately 335 times, and cost values of about 53% compared to a competitive heuristic (LKH3) in the case of 100 vehicles with 1,000 cities of mTSP. We provide reproducible source code: https://github.com/kaist-silab/equity-transformer.



Paperid:2247
Authors:Yubin Xiao, Di Wang, Boyang Li, Mingzhao Wang, Xuan Wu, Changliang Zhou, You Zhou
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, China, Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly, Nanyang Technological University, Singapore WeBank-NTU Joint Research Institute on Fintech, Nanyang Technological University, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, China, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, China, School of System Design and Intelligent Manufacturing, Southern University of Science and Technology, China, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, China
Abstract:
Neural construction models have shown promising performance for Vehicle Routing Problems (VRPs) by adopting either the Autoregressive (AR) or NonAutoregressive (NAR) learning approach. While AR models produce high-quality solutions, they generally have a high inference latency due to their sequential generation nature. Conversely, NAR models generate solutions in parallel with a low inference latency but generally exhibit inferior performance. In this paper, we propose a generic Guided Non-Autoregressive Knowledge Distillation (GNARKD) method to obtain high-performance NAR models having a low inference latency. GNARKD removes the constraint of sequential generation in AR models while preserving the learned pivotal components in the network architecture to obtain the corresponding NAR models through knowledge distillation. We evaluate GNARKD by applying it to three widely adopted AR models to obtain NAR VRP solvers for both synthesized and real-world instances. The experimental results demonstrate that GNARKD significantly reduces the inference time (4-5 times faster) with acceptable performance drop (2-3%). To the best of our knowledge, this study is first-of-its-kind to obtain NAR VRP solvers from AR ones through knowledge distillation.



Paperid:2248
Authors:Haoran Ye, Jiarui Wang, Helan Liang, Zhiguang Cao, Yong Li, Fanzhang Li
Soochow University, China, Soochow University, China, Soochow University, China, Singapore Management University, Singapore, Tsinghua University, China, Soochow University, China
Abstract:
The recent endto-end neural solvers have shown promise for small-scale routing problems but suffered from limited real-time scaling-up performance. This paper proposes GLOP (Global and Local Optimization Policies), a unified hierarchical framework that efficiently scales toward large-scale routing problems. GLOP hierarchically partitions large routing problems into Travelling Salesman Problems (TSPs) and TSPs into Shortest Hamiltonian Path Problems. For the first time, we hybridize non-autoregressive neural heuristics for coarse-grained problem partitions and autoregressive neural heuristics for fine-grained route constructions, leveraging the scalability of the former and the meticulousness of the latter. Experimental results show that GLOP achieves competitive and state-of-the-art real-time performance on large-scale routing problems, including TSP, ATSP, CVRP, and PCTSP. Our code is available at: https://github.com/henry-yeh/GLOP.



Paperid:2249
Authors:Keyuan Zhang, Zhongdong Liu, Nakjung Choi, Bo Ji
Virginia Tech, Virginia Tech, Nokia Bell Labs, Virginia Tech
Abstract:
In this paper, we study the twolevel ski-rental problem, where a user needs to fulfill a sequence of demands for multiple items by choosing one of the three payment options: paying for the on-demand usage (i.e., rent), buying individual items (i.e., single purchase), and buying all the items (i.e., combo purchase). Without knowing future demands, the user aims to minimize the total cost (i.e., the sum of the rental, single purchase, and combo purchase costs) by balancing the trade-off between the expensive upfront costs (for purchase) and the potential future expenses (for rent). We first design a robust online algorithm (RDTSR) that offers a worst-case performance guarantee. While online algorithms are robust against the worst-case scenarios, they are often overly cautious and thus suffer a poor average performance in typical scenarios. On the other hand, Machine Learning (ML) algorithms typically show promising average performance in various applications but lack worst-case performance guarantees. To harness the benefits of both methods, we develop a learning-augmented algorithm (LADTSR) by integrating ML predictions into the robust online algorithm, which outperforms the robust online algorithm under accurate predictions while ensuring worst-case performance guarantees even when predictions are inaccurate. Finally, we conduct numerical experiments on both synthetic and real-world trace data to corroborate the effectiveness of our approach.



Paperid:2250
Authors:Amir Mohammad Abouei, Ehsan Mokhtarian, Negar Kiyavash
EPFL, EPFL, EPFL
Abstract:
Causal inference in a subpopulation involves identifying the causal effect of an intervention on a specific subgroup, which is distinguished from the whole population through the influence of systematic biases in the sampling process. However, ignoring the subtleties introduced by sub-populations can either lead to erroneous inference or limit the applicability of existing methods. We introduce and advocate for a causal inference problem in sub-populations (henceforth called s-ID), in which we merely have access to observational data of the targeted sub-population (as opposed to the entire population). Existing inference problems in sub-populations operate on the premise that the given data distributions originate from the entire population, thus, cannot tackle the s-ID problem. To address this gap, we provide necessary and sufficient conditions that must hold in the causal graph for a causal effect in a sub-population to be identifiable from the observational distribution of that sub-population. Given these conditions, we present a sound and complete algorithm for the s-ID problem.



Paperid:2251
Authors:Ziqiao Ao, Jinglai Li
University of Birmingham, University of Birmingham
Abstract:
Bayesian Experimental Design (BED), which aims to find the optimal experimental conditions for Bayesian inference, is usually posed as to optimize the expected information gain (EIG). The gradient information is often needed for efficient EIG optimization, and as a result the ability to estimate the gradient of EIG is essential for BED problems. The primary goal of this work is to develop methods for estimating the gradient of EIG, which, combined with the stochastic gradient descent algorithms, result in efficient optimization of EIG. Specifically, we first introduce a posterior expected representation of the EIG gradient with respect to the design variables. Based on this, we propose two methods for estimating the EIG gradient, UEEGMCMC that leverages posterior samples generated through Markov Chain Monte Carlo (MCMC) to estimate the EIG gradient, and BEEG-AP that focuses on achieving high simulation efficiency by repeatedly using parameter samples. Theoretical analysis and numerical studies illustrate that UEEG-MCMC is robust agains the actual EIG value, while BEEG-AP is more efficient when the EIG value to be optimized is small. Moreover, both methods show superior performance compared to several popular benchmarks in our numerical experiments.



Paperid:2252
Authors:Christel Baier, Roxane van den Bossche, Sascha Klüppelholz, Johannes Lehmann, Jakob Piribauer
TU Dresden, Germany Centre for Tactile Internet with Human-in-the-Loop (CeTI), Université Paris-Saclay, ENS Paris-Saclay, France, TU Dresden, Germany, TU Dresden, Germany Centre for Tactile Internet with Human-in-the-Loop (CeTI), TU Dresden, Germany
Abstract:
To improve reliability and the understanding of AI systems, there is increasing interest in the use of formal methods, e.g. model checking. Model checking tools produce a counterexample when a model does not satisfy a property. Understanding these counterexamples is critical for efficient debugging, as it allows the developer to focus on the parts of the program that caused the issue. To this end, we present a new technique that ascribes a responsibility value to each state in a transition system that does not satisfy a given safety property. The value is higher if the nondeterministic choices in a state have more power to change the outcome, given the behaviour observed in the counterexample. For this, we employ a concept from cooperative game theory – namely general power indices, such as the Shapley value – to compute the responsibility of the states. We present an optimistic and pessimistic version of responsibility that differ in how they treat the states that do not lie on the counterexample. We give a characterisation of optimistic responsibility that leads to an efficient algorithm for it and show computational hardness of the pessimistic version. We also present a tool to compute responsibility and show how a stochastic algorithm can be used to approximate responsibility in larger models. These methods can be deployed in the design phase, at runtime and at inspection time to gain insights on causal relations within the behavior of AI systems.



Paperid:2253
Authors:Amitay Bar, Rotem Mulayoff, Tomer Michaeli, Ronen Talmon
Technion, Technion, Technion, Technion
Abstract:
Langevin dynamics (LD) is widely used for sampling from distributions and for optimization. In this work, we derive a closedform expression for the expected loss of preconditioned LD near stationary points of the objective function. We use the fact that at the vicinity of such points, LD reduces to an Ornstein–Uhlenbeck process, which is amenable to convenient mathematical treatment. Our analysis reveals that when the preconditioning matrix satisfies a particular relation with respect to the noise covariance, LD's expected loss becomes proportional to the rank of the objective's Hessian. We illustrate the applicability of this result in the context of neural networks, where the Hessian rank has been shown to capture the complexity of the predictor function but is usually computationally hard to probe. Finally, we use our analysis to compare SGD-like and Adam-like preconditioners and identify the regimes under which each of them leads to a lower expected loss.



Paperid:2254
Authors:Ben Berger, Tomer Ezra, Michal Feldman, Federico Fusco
Offchain Labs, Inc., Simons Laufer Mathematical Sciences Institute, Tel Aviv University Microsoft ILDC, Sapienza University of Rome
Abstract:
Pandora’s problem is a fundamental model that studies optimal search under costly inspection. In the classic version, there are n boxes, each associated with a known cost and a known distribution over values. A strategy inspects the boxes sequentially and obtains a utility that equals the difference between the maximum value of an inspected box and the total inspection cost. Weitzman (1979) presented a surprisingly simple strategy that obtains the optimal expected utility. In this work we introduce a new variant of Pandora’s problem in which every box is also associated with a publicly known deadline, indicating the final round by which its value may be chosen. This model captures many reallife scenarios where alternatives admit deadlines, such as candidate interviews and college admissions. Our main result is an efficient threshold-based strategy that achieves a constant approximation relative to the performance of the optimal strategy for the deadlines setting.



Paperid:2255
Authors:Soumia Boucherouite, Grigory Malinovsky, Peter Richtárik, El Houcine Bergou
College of Computing, Mohammed VI Polytechnic University, King Abdullah University of Science and Technology, King Abdullah University of Science and Technology, College of Computing, Mohammed VI Polytechnic University
Abstract:
We present a new zeroorder optimization method called Minibatch Stochastic Three Points (MiSTP), specifically designed to solve stochastic unconstrained minimization problems when only an approximate evaluation of the objective function is possible. MiSTP is an extension of the Stochastic Three Point Method (STP). The key innovation of MiSTP is that it selects the next point solely based on the objective function approximation, without relying on its exact evaluation. At each iteration, MiSTP generates a random search direction and compares the approximations of the objective function at the current point, the randomly generated direction and its opposite. The best of these three points is chosen as the next iterate. We analyze the worst-case complexity of MiSTP in the convex and non-convex cases and demonstrate that it matches the most accurate complexity bounds known in the literature for zero-order optimization methods. We perform extensive numerical evaluations to assess the computational efficiency of MiSTP and compare its performance to other state-of-the-art methods by testing it on several machine learning tasks. The results show that MiSTP outperforms or has comparable performance against state-of-the-art methods indicating its potential for a wide range of practical applications.



Paperid:2256
Authors:Wei Chen, Zhiyi Huang, Ruichu Cai, Zhifeng Hao, Kun Zhang
School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China College of Science, Shantou University, Shantou, China, Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA, United States Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Abstract:
Causal discovery with latent variables is a crucial but challenging task. Despite the emergence of numerous methods aimed at addressing this challenge, they are not fully identified to the structure that two observed variables are influenced by one latent variable and there might be a directed edge in between. Interestingly, we notice that this structure can be identified through the utilization of higherorder cumulants. By leveraging the higher-order cumulants of non-Gaussian data, we provide an analytical solution for estimating the causal coefficients or their ratios. With the estimated (ratios of) causal coefficients, we propose a novel approach to identify the existence of a causal edge between two observed variables subject to latent variable influence. In case when such a causal edge exits, we introduce an asymmetry criterion to determine the causal direction. The experimental results demonstrate the effectiveness of our proposed method.



Paperid:2257
Authors:Adam D. Cobb, Brian Matejek, Daniel Elenius, Anirban Roy, Susmit Jha
SRI International, SRI International, SRI International, SRI International, SRI International
Abstract:
We introduce a new amortized likelihood ratio estimator for likelihoodfree simulation-based inference (SBI). Our estimator is simple to train and estimates the likelihood ratio using a single forward pass of the neural estimator. Our approach directly computes the likelihood ratio between two competing parameter sets which is different from the previous approach of comparing two neural network output values. We refer to our model as the direct neural ratio estimator (DNRE). As part of introducing the DNRE, we derive a corresponding Monte Carlo estimate of the posterior. We benchmark our new ratio estimator and compare to previous ratio estimators in the literature. We show that our new ratio estimator often outperforms these previous approaches. As a further contribution, we introduce a new derivative estimator for likelihood ratio estimators that enables us to compare likelihood-free Hamiltonian Monte Carlo (HMC) with random-walk Metropolis-Hastings (MH). We show that HMC is equally competitive, which has not been previously shown. Finally, we include a novel real-world application of SBI by using our neural ratio estimator to design a quadcopter. Code is available at https://github.com/SRI-CSL/dnre.



Paperid:2258
Authors:Longchao Da, Porter Jenkins, Trevor Schwantes, Jeffrey Dotson, Hua Wei
Arizona State University, Brigham Young University, Brigham Young University, Brigham Young University, Arizona State University
Abstract:
In practice, it is essential to compare and rank candidate policies offline before realworld deployment for safety and reliability. Prior work seeks to solve this offline policy ranking (OPR) problem through value-based methods, such as Off-policy evaluation (OPE). However, they fail to analyze special case performance (e.g., worst or best cases), due to the lack of holistic characterization of policies’ performance. It is even more difficult to estimate precise policy values when the reward is not fully accessible under sparse settings. In this paper, we present Probabilistic Offline Policy Ranking (POPR), a framework to address OPR problems by leveraging expert data to characterize the probability of a candidate policy behaving like experts, and approximating its entire performance posterior distribution to help with ranking. POPR does not rely on value estimation, and the derived performance posterior can be used to distinguish candidates in worst-, best-, and average-cases. To estimate the posterior, we propose POPR-EABC, an Energy-based Approximate Bayesian Computation (ABC) method conducting likelihood-free inference. POPR-EABC reduces the heuristic nature of ABC by a smooth energy function, and improves the sampling efficiency by a pseudo-likelihood. We empirically demonstrate that POPR-EABC is adequate for evaluating policies in both discrete and continuous action spaces across various experiment environments, and facilitates probabilistic comparisons of candidate policies before deployment.



Paperid:2259
Authors:Julien Fageot, Sadegh Farhadkhani, Lê-Nguyên Hoang, Oscar Villemaud
Tournesol Association EPFL, EPFL, Tournesol Association Calicarpa, Tournesol Association EPFL
Abstract:
Many applications, e.g. in content recommendation, sports, or recruitment, leverage the comparisons of alternatives to score those alternatives. The classical BradleyTerry model and its variants have been widely used to do so. The historical model considers binary comparisons (victory/defeat) between alternatives, while more recent developments allow finer comparisons to be taken into account. In this article, we introduce a probabilistic model encompassing a broad variety of paired comparisons that can take discrete or continuous values. We do so by considering a well-behaved subset of the exponential family, which we call the family of generalized Bradley-Terry (GBT) models, as it includes the classical Bradley-Terry model and many of its variants. Remarkably, we prove that all GBT models are guaranteed to yield a strictly convex negative log-likelihood. Moreover, assuming a Gaussian prior on alternatives' scores, we prove that the maximum a posteriori (MAP) of GBT models, whose existence, uniqueness and fast computation are thus guaranteed, varies monotonically with respect to comparisons (the more A beats B, the better the score of A) and is Lipschitz-resilient with respect to each new comparison (a single new comparison can only have a bounded effect on all the estimated scores). These desirable properties make GBT models appealing for practical use. We illustrate some features of GBT models on simulations.



Paperid:2260
Authors:Simon Ferreira, Charles K. Assaad
EasyVista ENS de Lyon, EasyVista
Abstract:
Dynamic structural causal models (SCMs) are a powerful framework for reasoning in dynamic systems about direct effects which measure how a change in one variable affects another variable while holding all other variables constant. The causal relations in a dynamic structural causal model can be qualitatively represented with an acyclic fulltime causal graph. Assuming linearity and no hidden confounding and given the full-time causal graph, the direct causal effect is always identifiable. However, in many application such a graph is not available for various reasons but nevertheless experts have access to the summary causal graph of the full-time causal graph which represents causal relations between time series while omitting temporal information and allowing cycles. This paper presents a complete identifiability result which characterizes all cases for which the direct effect is graphically identifiable from a summary causal graph and gives two sound finite adjustment sets that can be used to estimate the direct effect whenever it is identifiable.



Paperid:2261
Authors:Andreas Goral, Joachim Giesen, Mark Blacher, Christoph Staudt, Julien Klaus
Friedrich Schiller University Jena, Friedrich Schiller University Jena, Friedrich Schiller University Jena, Friedrich Schiller University Jena, Friedrich Schiller University Jena
Abstract:
Many decision and optimization problems have natural extensions as counting problems. The best known example is the Boolean satisfiability problem (SAT), where we want to count the satisfying assignments of truth values to the variables, which is known as the #SAT problem. Likewise, for discrete optimization problems, we want to count the states on which the objective function attains the optimal value. Both SAT and discrete optimization can be formulated as selective marginalize a product function (MPF) queries. Here, we show how general selective MPF queries can be extended for model counting. MPF queries are encoded as tensor hypernetworks over suitable semirings that can be solved by generic tensor hypernetwork contraction algorithms. Our model counting extension is again an MPF query, on an extended semiring, that can be solved by the same contraction algorithms. Model counting is required for uniform model sampling. We show how the counting extension can be further extended for model sampling by constructing yet another semiring. We have implemented the model counting and sampling extensions. Experiments show that our generic approach is competitive with the state of the art in model counting and model sampling.



Paperid:2262
Authors:Aaryan Gupta, Markus Bläser
Indian Institute of Technology Bombay, Saarland University
Abstract:
Linear structural causal models (SCMs) are used to express and analyze the relationships between random variables. Direct causal effects are represented as directed edges and confounding factors as bidirected edges. Identifying the causal parameters from correlations between the nodes is an open problem in artificial intelligence. In this paper, we study SCMs whose directed component forms a tree. Van der Zander et al. give a PSPACEalgorithm for the identification problem in this case, which is a significant improvement over the general Gröbner basis approach, which has doubly-exponential time complexity in the number of structural parameters. However, they do not show that their algorithm is complete. In this work, we present a randomized polynomial-time algorithm, which solves the identification problem for tree-shaped SCMs. For every structural parameter, our algorithms decides whether it is generically identifiable, generically 2-identifiable, or generically unidentifiable. (No other cases can occur.) In the first two cases, it provides one or two fractional affine square root terms of polynomials (FASTPs) for the corresponding parameter, respectively. In particular, our algorithm is not only polynomial time, but also complete for for tree-shaped SCMs.



Paperid:2263
Authors:Margot Herin, Patrice Perny, Nataliya Sokolovska
LIP6 Sorbonne University, Sorbonne University, Sorbonne University
Abstract:
We propose an approach to learn a multiattribute utility function to model, explain or predict the value system of a Decision Maker. The main challenge of the modelling task is to describe human values and preferences in the presence of interacting attributes while keeping the utility function as simple as possible. We focus on the generalized additive decomposable utility model which allows interactions between attributes while preserving some additive decomposability of the evaluation model. We present a learning approach able to identify the factors of interacting attributes and to learn the utility functions defined on these factors. This approach relies on the determination of a sparse representation of the ANOVA decomposition of the multiattribute utility function using multiple kernel learning. It applies to both continuous and discrete attributes. Numerical tests are performed to demonstrate the practical efficiency of the learning approach.



Paperid:2264
Authors:Shunsuke Horii, Yoichi Chikahara
Waseda University, NTT
Abstract:
Estimating heterogeneous treatment effects across individuals has attracted growing attention as a statistical tool for performing critical decisionmaking. We propose a Bayesian inference framework that quantifies the uncertainty in treatment effect estimation to support decision-making in a relatively small sample size setting. Our proposed model places Gaussian process priors on the nonparametric components of a semiparametric model called a partially linear model. This model formulation has three advantages. First, we can analytically compute the posterior distribution of a treatment effect without relying on the computationally demanding posterior approximation. Second, we can guarantee that the posterior distribution concentrates around the true one as the sample size goes to infinity. Third, we can incorporate prior knowledge about a treatment effect into the prior distribution, improving the estimation efficiency. Our experimental results show that even in the small sample size setting, our method can accurately estimate the heterogeneous treatment effects and effectively quantify its estimation uncertainty.



Paperid:2265
Authors:Hao Huang, Qian Yan, Keqi Han, Ting Gan, Jiawei Jiang, Quanqing Xu, Chuanhui Yang
Wuhan University, Wuhan University, Wuhan University, Wuhan University, Wuhan University, OceanBase, OceanBase
Abstract:
To infer a diffusion network based on observations from historical diffusion processes, existing approaches assume that observation data contain exact occurrence time of each node infection, or at least the eventual infection statuses of nodes in each diffusion process. They determine potential influence relationships between nodes by identifying frequent sequences, or statistical correlations, among node infections. In some realworld settings, such as the spread of epidemics, tracing exact infection times is often infeasible due to a high cost; even obtaining precise infection statuses of nodes is a challenging task, since observable symptoms such as headache only partially reveal a node’s true status. In this work, we investigate how to effectively infer a diffusion network from observation data with uncertainty. Provided with only probabilistic information about node infection statuses, we formulate the problem of diffusion network inference as a constrained nonlinear regression w.r.t. the probabilistic data. An alternating maximization method is designed to solve this regression problem iteratively, and the improvement of solution quality in each iteration can be theoretically guaranteed. Empirical studies are conducted on both synthetic and real-world networks, and the results verify the effectiveness and efficiency of our approach.



Paperid:2266
Authors:Wen Huang, Xintao Wu
University of Arkansas, University of Arkansas
Abstract:
This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm’s reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the ground truth mean reward and can effectively guide the bandit agent to learn a nearlyoptimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could help consistently reduce the asymptotic regret.



Paperid:2267
Authors:Dongyan (Lucy) Huo, Yudong Chen, Qiaomin Xie
Cornell University, University of Wisconsin-Madison, University of Wisconsin-Madison
Abstract:
In this paper, we study the effectiveness of using a constant stepsize in statistical inference via linear stochastic approximation (LSA) algorithms with Markovian data. After establishing a Central Limit Theorem (CLT), we outline an inference procedure that uses averaged LSA iterates to construct confidence intervals (CIs). Our procedure leverages the fast mixing property of constantstepsize LSA for better covariance estimation and employs Richardson-Romberg (RR) extrapolation to reduce the bias induced by constant stepsize and Markovian data. We develop theoretical results for guiding stepsize selection in RR extrapolation, and identify several important settings where the bias provably vanishes even without extrapolation. We conduct extensive numerical experiments and compare against classical inference approaches. Our results show that using a constant stepsize enjoys easy hyperparameter tuning, fast convergence, and consistently better CI coverage, especially when data is limited.



Paperid:2268
Authors:Thomas Krapf, Michael Hagn, Paul Miethaner, Alexander Schiller, Lucas Luttner, Bernd Heinrich
Faculty for Computer Science and Data Science, University of Regensburg, Faculty for Computer Science and Data Science, University of Regensburg, Faculty for Computer Science and Data Science, University of Regensburg, Faculty for Computer Science and Data Science, University of Regensburg, Faculty for Computer Science and Data Science, University of Regensburg, Faculty for Computer Science and Data Science, University of Regensburg
Abstract:
Realworld data typically exhibit aleatoric uncertainty which has to be considered during data-driven decision-making to assess the confidence of the decision provided by machine learning models. To propagate aleatoric uncertainty represented by probability distributions (PDs) through neural networks (NNs), both sampling-based and function approximation-based methods have been proposed. However, these methods suffer from significant approximation errors and are not able to accurately represent predictive uncertainty in the NN output. In this paper, we present a novel method, Piecewise Linear Transformation (PLT), for propagating PDs through NNs with piecewise linear activation functions (e.g., ReLU NNs). PLT does not require sampling or specific assumptions about the PDs. Instead, it harnesses the piecewise linear structure of such NNs to determine the propagated PD in the output space. In this way, PLT supports the accurate quantification of predictive uncertainty based on the criterion exactness of the propagated PD. We assess this exactness in theory by showing error bounds for our propagated PD. Further, our experimental evaluation validates that PLT outperforms competing methods on publicly available real-world classification and regression datasets regarding exactness. Thus, the PDs propagated by PLT allow to assess the uncertainty of the provided decisions, offering valuable support.



Paperid:2269
Authors:Ang Li, Judea Pearl
Florida State University, University of California, Los Angeles
Abstract:
Probabilities of causation are proven to be critical in modern decisionmaking. This paper deals with the problem of estimating the probabilities of causation when treatment and effect are not binary. Pearl defined the binary probabilities of causation, such as the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). Tian and Pearl then derived sharp bounds for these probabilities of causation using experimental and observational data. In this paper, we define and provide theoretical bounds for all types of probabilities of causation with multivalued treatments and effects. We further discuss examples where our bounds guide practical decisions and use simulation studies to evaluate how informative the bounds are for various data combinations.



Paperid:2270
Authors:Ang Li, Judea Pearl
Florida State University, University of California, Los Angeles
Abstract:
The unit selection problem aims to identify a set of individuals who are most likely to exhibit a desired mode of behavior or to evaluate the percentage of such individuals in a given population, for example, selecting individuals who would respond one way if encouraged and a different way if not encouraged. Using a combination of experimental and observational data, Li and Pearl solved the binary unit selection problem (binary treatment and effect) by deriving tight bounds on the "benefit function," which is the payoff/cost associated with selecting an individual with given characteristics. This paper extends the benefit function to the general form such that the treatment and effect are not restricted to binary. We then propose an algorithm to test the identifiability of the nonbinary benefit function and an algorithm to compute the bounds of the nonbinary benefit function using experimental and observational data.



Paperid:2271
Authors:Jinzhao Li, Nan Jiang, Yexiang Xue
Purdue University, Purdue University, Purdue University
Abstract:
Satisfiability Modulo Counting (SMC) encompasses problems that require both symbolic decisionmaking and statistical reasoning. Its general formulation captures many real-world problems at the intersection of symbolic and statistical AI. SMC searches for policy interventions to control probabilistic outcomes. Solving SMC is challenging because of its highly intractable nature (NP^PP-complete), incorporating statistical inference and symbolic reasoning. Previous research on SMC solving lacks provable guarantees and/or suffers from suboptimal empirical performance, especially when combinatorial constraints are present. We propose XOR-SMC, a polynomial algorithm with access to NP-oracles, to solve highly intractable SMC problems with constant approximation guarantees. XOR-SMC transforms the highly intractable SMC into satisfiability problems by replacing the model counting in SMC with SAT formulae subject to randomized XOR constraints. Experiments on solving important SMC problems in AI for social good demonstrate that XOR-SMC outperforms several baselines both in solution quality and running time.



Paperid:2272
Authors:Yuequn Liu, Ruichu Cai, Wei Chen, Jie Qiao, Yuguang Yan, Zijian Li, Keli Zhang, Zhifeng Hao
School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates, Huawei Noah’s Ark Lab, Huawei, Paris, France, College of Science, Shantou University, Shantou, China
Abstract:
Learning Granger causality from event sequences is a challenging but essential task across various applications. Most existing methods rely on the assumption that event sequences are independent and identically distributed (i.i.d.). However, this i.i.d. assumption is often violated due to the inherent dependencies among the event sequences. Fortunately, in practice, we find these dependencies can be modeled by a topological network, suggesting a potential solution to the noni.i.d. problem by introducing the prior topological network into Granger causal discovery. This observation prompts us to tackle two ensuing challenges: 1) how to model the event sequences while incorporating both the prior topological network and the latent Granger causal structure, and 2) how to learn the Granger causal structure. To this end, we devise a unified topological neural Poisson auto-regressive model with two processes. In the generation process, we employ a variant of the neural Poisson process to model the event sequences, considering influences from both the topological network and the Granger causal structure. In the inference process, we formulate an amortized inference algorithm to infer the latent Granger causal structure. We encapsulate these two processes within a unified likelihood function, providing an end-to-end framework for this task. Experiments on simulated and real-world data demonstrate the effectiveness of our approach.



Paperid:2273
Authors:Malte Luttermann, Tanya Braun, Ralf Möller, Marcel Gehrke
German Research Center for Artificial Intelligence (DFKI) University of Lübeck, University of Münster, German Research Center for Artificial Intelligence (DFKI) University of Lübeck, University of Lübeck
Abstract:
Lifted probabilistic inference exploits symmetries in a probabilistic model to allow for tractable probabilistic inference with respect to domain sizes. To apply lifted inference, a lifted representation has to be obtained, and to do so, the socalled colour passing algorithm is the state of the art. The colour passing algorithm, however, is bound to a specific inference algorithm and we found that it ignores commutativity of factors while constructing a lifted representation. We contribute a modified version of the colour passing algorithm that uses logical variables to construct a lifted representation independent of a specific inference algorithm while at the same time exploiting commutativity of factors during an offline-step. Our proposed algorithm efficiently detects more symmetries than the state of the art and thereby drastically increases compression, yielding significantly faster online query times for probabilistic inference when the resulting model is applied.



Paperid:2274
Authors:Phuoc Nguyen, Truyen Tran, Sunil Gupta, Thin Nguyen, Svetha Venkatesh
Deakin University, Deakin University, Deakin University, Australia, Deakin University, Deakin University
Abstract:
Identifying root causes of anomalies in causal processes is vital across disciplines. Once identified, one can isolate the root causes and implement necessary measures to restore the normal operation. Causal processes are often modelled as graphs with entities being nodes and their paths/interconnections as edge. Existing work only consider the contribution of nodes in the generative process, thus can not attribute the outlier score to the edges of the mechanism if the anomaly occurs in the connections. In this paper, we consider both individual edge and node of each mechanism when identifying the root causes. We introduce a noisy functional causal model to account for this purpose. Then, we employ Bayesian learning and inference methods to infer the noises of the nodes and edges. We then represent the functional form of a target outlier leaf as a function of the node and edge noises. Finally, we propose an efficient gradientbased attribution method to compute the anomaly attribution scores which scales linearly with the number of nodes and edges. Experiments on simulated datasets and two real-world scenario datasets show better anomaly attribution performance of the proposed method compared to the baselines. Our method scales to larger graphs with more nodes and edges.



Paperid:2275
Authors:Jie Qiao, Zhengming Chen, Jianhua Yu, Ruichu Cai, Zhifeng Hao
School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China, School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE, School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China, School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China Peng Cheng Laboratory, Shenzhen 518066, China, College of Science, Shantou University, Shantou 515063, China
Abstract:
Missing data are an unavoidable complication frequently encountered in many causal discovery tasks. When a missing process depends on the missing values themselves (known as selfmasking missingness), the recovery of the joint distribution becomes unattainable, and detecting the presence of such self-masking missingness remains a perplexing challenge. Consequently, due to the inability to reconstruct the original distribution and to discern the underlying missingness mechanism, simply applying existing causal discovery methods would lead to wrong conclusions. In this work, we found that the recent advances additive noise model has the potential for learning causal structure under the existence of the self-masking missingness. With this observation, we aim to investigate the identification problem of learning causal structure from missing data under an additive noise model with different missingness mechanisms, where the `no self-masking missingness' assumption can be eliminated appropriately. Specifically, we first elegantly extend the scope of identifiability of causal skeleton to the case with weak self-masking missingness (i.e., no other variable could be the cause of self-masking indicators except itself). We further provide the sufficient and necessary identification conditions of the causal direction under additive noise model and show that the causal structure can be identified up to an IN-equivalent pattern. We finally propose a practical algorithm based on the above theoretical results on learning the causal skeleton and causal direction. Extensive experiments on synthetic and real data demonstrate the efficiency and effectiveness of the proposed algorithms.



Paperid:2276
Authors:Jie Qiao, Yu Xiang, Zhengming Chen, Ruichu Cai, Zhifeng Hao
School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China, School of Computer Science, Guangdong University of Technology, Guangzhou, China Peng Cheng Laboratory, Shenzhen, China, College of Science, Shantou University, Shantou, China
Abstract:
Count data naturally arise in many fields, such as finance, neuroscience, and epidemiology, and discovering causal structure among count data is a crucial task in various scientific and industrial scenarios. One of the most common characteristics of count data is the inherent branching structure described by a binomial thinning operator and an independent Poisson distribution that captures both branching and noise. For instance, in a population count scenario, mortality and immigration contribute to the count, where survival follows a Bernoulli distribution, and immigration follows a Poisson distribution. However, causal discovery from such data is challenging due to the nonidentifiability issue: a single causal pair is Markov equivalent, i.e., X->Y and Y->X are distributed equivalent. Fortunately, in this work, we found that the causal order from X to its child Y is identifiable if X is a root vertex and has at least two directed paths to Y, or the ancestor of X with the most directed path to X has a directed path to Y without passing X. Specifically, we propose a Poisson Branching Structure Causal Model (PB-SCM) and perform a path analysis on PB-SCM using high-order cumulants. Theoretical results establish the connection between the path and cumulant and demonstrate that the path information can be obtained from the cumulant. With the path information, causal order is identifiable under some graphical conditions. A practical algorithm for learning causal structure under PB-SCM is proposed and the experiments demonstrate and verify the effectiveness of the proposed method.



Paperid:2277
Authors:Vidya Sagar Sharma
Tata Institute of Fundamental Research, Mumbai
Abstract:
Causal DAGs (also known as Bayesian networks) are a popular tool for encoding conditional dependencies between random variables. In a causal DAG, the random variables are modeled as vertices in the DAG, and it is stipulated that every random variable is independent of its nondescendants conditioned on its parents. It is possible, however, for two different causal DAGs on the same set of random variables to encode exactly the same set of conditional dependencies. Such causal DAGs are said to be Markov equivalent, and equivalence classes of Markov equivalent DAGs are known as Markov Equivalent Classes (MECs). Beautiful combinatorial characterizations of MECs have been developed in the past few decades, and it is known, in particular, that all DAGs in the same MEC must have the same skeleton (underlying undirected graph) and v-structures (induced subgraph of the form a->b<-c). These combinatorial characterizations also suggest several natural algorithmic questions. One of these is: given an undirected graph G as input, how many distinct Markov equivalence classes have the skeleton G? Much work has been devoted in the last few years to this and other closely related problems. However, to the best of our knowledge, a polynomial-time algorithm for the problem remains unknown. In this paper, we make progress towards this goal by giving a fixed parameter tractable algorithm for the above problem, with the parameters being the treewidth and the maximum degree of the input graph G. The main technical ingredient in our work is a construction we refer to as shadow, which lets us create a local description of long-range constraints imposed by the combinatorial characterizations of MECs.



Paperid:2278
Authors:Shouta Sugahara, Koya Kato, Maomi Ueno
The University of Electro-Communications, The University of Electro-Communications, The University of Electro-Communications
Abstract:
This study proposes and evaluates a new Bayesian network classifier (BNC) having an Imap structure with the fewest class variable parameters among all structures for which the class variable has no parent. Moreover, a new learning algorithm to learn our proposed model is presented. The proposed method is guaranteed to obtain the true classification probability asymptotically. Moreover, the method has lower computational costs than those of exact learning BNC using marginal likelihood. Comparison experiments have demonstrated the superior performance of the proposed method.



Paperid:2279
Authors:Armin Toroghi, Scott Sanner
Department of Mechanical and Industrial Engineering, University of Toronto, Department of Mechanical and Industrial Engineering, University of Toronto Vector Institute of Artificial Intelligence, Toronto
Abstract:
Knowledge Graphs (KGs) provide a widely used format for representing entities and their relationships and have found use in diverse applications including question answering and recommendation. A majority of current research on KG inference has focused on reasoning with atomic facts (triples) and has disregarded the possibility of making complex evidential observations involving logical operators (negation, conjunction, disjunction) and quantifiers (existential, universal). Further, while the application of complex evidence has been explored in KGbased query answering (KGQA) research, in many practical online settings, observations are made sequentially. For example, in KGQA, additional context may be incrementally suggested to narrow down the answer. Or in interactive recommendation, user critiques may be expressed sequentially in order to narrow down a set of preferred items. Both settings are indicative of information filtering or tracking tasks that are reminiscent of belief tracking in Bayesian inference. In fact, in this paper, we precisely cast the problem of belief tracking over unknown KG entities given incremental complex KG evidence as a Bayesian filtering problem. Specifically, we leverage Knowledge-based Model Construction (KBMC) over the logical KG evidence to instantiate a Markov Random Field (MRF) likelihood representation to perform closed-form Bayesian inference with complex KG evidence (BIKG). We experimentally evaluate BIKG in incremental KGQA and interactive recommendation tasks demonstrating that it outperforms non-incremental methodologies and leads to better incorporation of conjunctive evidence vs. existing complex KGQA methods like CQD that leverage fuzzy T-norm operators. Overall, this work demonstrates a novel, efficient, and unified perspective of logic, KGs, and online inference through the lens of closed-form BIKG.



Paperid:2280
Authors:Russell Tsuchida, Cheng Soon Ong, Dino Sejdinovic
Data61-CSIRO, Data61-CSIRO Australian National University, University of Adelaide
Abstract:
We introduce squared neural Poisson point processes (SNEPPPs) by parameterising the intensity function by the squared norm of a two layer neural network. When the hidden layer is fixed and the second layer has a single neuron, our approach resembles previous uses of squared Gaussian process or kernel methods, but allowing the hidden layer to be learnt allows for additional flexibility. In many cases of interest, the integrated intensity function admits a closed form and can be computed in quadratic time in the number of hidden neurons. We enumerate a far more extensive number of such cases than has previously been discussed. Our approach is more memory and time efficient than naive implementations of squared or exponentiated kernel methods or Gaussian processes. Maximum likelihood and maximum a posteriori estimates in a reparameterisation of the final layer of the intensity function can be obtained by solving a (strongly) convex optimisation problem using projected gradient descent. We demonstrate SNEPPPs on real, and synthetic benchmarks, and provide a software implementation.



Paperid:2281
Authors:Gabriele Venturato, Vincent Derkinderen, Pedro Zuidberg Dos Martires, Luc De Raedt
KU Leuven, KU Leuven, Örebro University, KU Leuven Örebro University
Abstract:
Decision making under uncertainty in dynamic environments is a fundamental AI problem in which agents need to determine which decisions (or actions) to make at each time step to maximise their expected utility. Dynamic decision networks (DDNs) are an extension of dynamic Bayesian networks with decisions and utilities. DDNs can be used to compactly represent Markov decision processes (MDPs). We propose a novel algorithm called maplcirup that leverages knowledge compilation techniques developed for (dynamic) Bayesian networks to perform inference and gradient-based learning in DDNs. Specifically, we knowledge-compile the Bellman update present in DDNs into dynamic decision circuits and evaluate them within an (algebraic) model counting framework. In contrast to other exact symbolic MDP approaches, we obtain differentiable circuits that enable gradient-based parameter learning.



Paperid:2282
Authors:Marcel Wienöbst, Benito van der Zander, Maciej Liśkiewicz
University of Lübeck, University of Lübeck, University of Lübeck
Abstract:
Causal effect estimation from observational data is a fundamental task in empirical sciences. It becomes particularly challenging when unobserved confounders are involved in a system. This paper focuses on frontdoor adjustment – a classic technique which, using observed mediators allows to identify causal effects even in the presence of unobserved confounding. While the statistical properties of the front-door estimation are quite well understood, its algorithmic aspects remained unexplored for a long time. In 2022, Jeong, Tian, and Bareinboim presented the first polynomial-time algorithm for finding sets satisfying the front-door criterion in a given directed acyclic graph (DAG), with an O(n³(n+m)) run time, where n denotes the number of variables and m the number of edges of the causal graph. In our work, we give the first linear-time, i.e., O(n+m), algorithm for this task, which thus reaches the asymptotically optimal time complexity. This result implies an O(n(n+m)) delay enumeration algorithm of all front-door adjustment sets, again improving previous work by a factor of n³. Moreover, we provide the first linear-time algorithm for finding a minimal front-door adjustment set. We offer implementations of our algorithms in multiple programming languages to facilitate practical usage and empirically validate their feasibility, even for large graphs.



Paperid:2283
Authors:Kevin Xia, Elias Bareinboim
Columbia University, Columbia University
Abstract:
The ability of humans to understand the world in terms of cause and effect relationships, as well as their ability to compress information into abstract concepts, are two hallmark features of human intelligence. These two topics have been studied in tandem under the theory of causal abstractions, but it is an open problem how to best leverage abstraction theory in realworld causal inference tasks, where the true model is not known, and limited data is available in most practical settings. In this paper, we focus on a family of causal abstractions constructed by clustering variables and their domains, redefining abstractions to be amenable to individual causal distributions. We show that such abstractions can be learned in practice using Neural Causal Models, allowing us to utilize the deep learning toolkit to solve causal tasks (identification, estimation, sampling) at different levels of abstraction granularity. Finally, we show how representation learning can be used to learn abstractions, which we apply in our experiments to scale causal inferences to high dimensional settings such as with image data.



Paperid:2284
Authors:Hantao Yang, Xutong Liu, Zhiyong Wang, Hong Xie, John C. S. Lui, Defu Lian, Enhong Chen
University of Science and Technology of China, The Chinese University of Hong Kong, The Chinese University of Hong Kong, University of Science and Technology of China, The Chinese University of Hong Kong, University of Science and Technology of China, University of Science and Technology of China
Abstract:
We study the problem of federated contextual combinatorial cascading bandits, where agents collaborate under the coordination of a central server to provide tailored recommendations to users. Existing works consider either a synchronous framework, necessitating full agent participation and global synchronization, or assume user homogeneity with identical behaviors. We overcome these limitations by considering (1) federated agents operating in an asynchronous communication paradigm, where no mandatory synchronization is required and all agents communicate independently with the server, (2) heterogeneous user behaviors, where users can be stratified into latent user clusters, each exhibiting distinct preferences. For this setting, we propose a UCBtype algorithm with delicate communication protocols. Through theoretical analysis, we give sub-linear regret bounds on par with those achieved in the synchronous framework, while incurring only logarithmic communication costs. Empirical evaluation on synthetic and real-world datasets validates our algorithm's superior performance in terms of regrets and communication costs.



Paperid:2285
Authors:Shenbao Yu, Yifeng Zeng, Fan Yang, Yinghui Pan
Xiamen University, China, Northumbria University, UK, Xiamen University, China, Shenzhen University, China
Abstract:
Knowing a prerequisite structure among skills in a subject domain effectively enables several educational applications, including intelligent tutoring systems and curriculum planning. Traditionally, educators or domain experts use intuition to determine the skills' prerequisite relationships, which is timeconsuming and prone to fall into the trap of blind spots. In this paper, we focus on inferring the prerequisite structure given access to students' performance on exercises in a subject. Nevertheless, it is challenging since students' mastery of skills can not be directly observed, but can only be estimated, i.e., its latency in nature. To tackle this problem, we propose a causal-driven skill prerequisite structure discovery (CSPS) method in a two-stage learning framework. In the first stage, we learn the skills' correlation relationships presented in the covariance matrix from the student performance data while, through the predicted covariance matrix in the second stage, we consider a heuristic method based on conditional independence tests and standardized partial variance to discover the prerequisite structure. We demonstrate the performance of the new approach with both simulated and real-world data. The experimental results show the effectiveness of the proposed model for identifying the skills' prerequisite structure.



Paperid:2286
Authors:Weijia Zhang, Chun Kai Ling, Xuanhui Zhang
The University of Newcastle, Carnegie Mellon University, Nanjing University
Abstract:
Censoring is the central problem in survival analysis where either the timeto-event (for instance, death), or the time-to censoring (such as loss of follow-up) is observed for each sample. The majority of existing machine learning-based survival analysis methods assume that survival is conditionally independent of censoring given a set of covariates; an assumption that cannot be verified since only marginal distributions is available from the data. The existence of dependent censoring, along with the inherent bias in current estimators has been demonstrated in a variety of applications, accentuating the need for a more nuanced approach. However, existing methods that adjust for dependent censoring require practitioners to specify the ground truth copula. This requirement poses a significant challenge for practical applications, as model misspecification can lead to substantial bias. In this work, we propose a flexible deep learning-based survival analysis method that simultaneously accommodate for dependent censoring and eliminates the requirement for specifying the ground truth copula. We theoretically prove the identifiability of our model under a broad family of copulas and survival distributions. Experiments results from a wide range of datasets demonstrate that our approach successfully discerns the underlying dependency structure and significantly reduces survival estimation bias when compared to existing methods.



Paperid:2287
Authors:Ahmed Abbas, Paul Swoboda
Max Planck Institute for Informatics, Saarland Informatics Campus, Max Planck Institute for Informatics, Saarland Informatics Campus University of Mannheim Heinrich-Heine University Dusseldorf
Abstract:
We present a fast, scalable, datadriven approach for solving relaxations of 0-1 integer linear programs. We use a combination of graph neural networks (GNN) and a Lagrange decomposition based algorithm. We make the latter differentiable for end-to-end training and use GNNs to predict its algorithmic parameters. This allows to retain the algorithm's theoretical properties including dual feasibility and guaranteed non-decrease in the lower bound while improving it via training. We overcome suboptimal fixed points of the basic solver by additional non-parametric GNN update steps maintaining dual feasibility. For training we use an unsupervised loss. We train on smaller problems and test on larger ones showing strong generalization performance with a GNN comprising only around 10k parameters. Our solver achieves significantly faster performance and better dual objectives than its non-learned version, achieving close to optimal objective values of LP relaxations of very large structured prediction problems and on selected combinatorial ones. In particular, we achieve better objective values than specialized approximate solvers for specific problem classes while retaining their efficiency. Our solver has better any-time performance over a large time period compared to a commercial solver.



Paperid:2288
Authors:Florent Avellaneda, Roger Villemaire
Université du Québec à Montréal Centre de Recherche de l’Institut Universitaire de Gériatrie de Montréal, Université du Québec à Montréal
Abstract:
The Boolean Matrix Factorization (BMF) problem aims to represent a n×m Boolean matrix as the Boolean product of two matrices of small rank k, where the product is computed using Boolean algebra operations. However, finding a BMF of minimum rank is known to be NPhard, posing challenges for heuristic algorithms and exact approaches in terms of rank found and computation time, particularly as matrix size or the number of entries equal to 1 grows. In this paper, we present a new approach to simplifying the matrix to be factorized by reducing the number of 1-entries, which allows to directly recover a Boolean factorization of the original matrix from its simplified version. We introduce two types of simplification: one that performs numerous simplifications without preserving the original rank and another that performs fewer simplifications but guarantees that an optimal BMF on the simplified matrix yields an optimal BMF on the original matrix. Furthermore, our experiments show that our approach outperforms existing exact BMF algorithms.



Paperid:2289
Authors:Lulu Cao, Yufei Liu, Zhenzhong Wang, Dejun Xu, Kai Ye, Kay Chen Tan, Min Jiang
School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Department of Computing, The Hong Kong Polytechnic University, School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Department of Computing, The Hong Kong Polytechnic University, School of Informatics, Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Abstract:
In recent years, machine learning algorithms, especially deep learning, have shown promising prospects in solving Partial Differential Equations (PDEs). However, as the dimension increases, the relationship and interaction between variables become more complex, and existing methods are difficult to provide fast and interpretable solutions for highdimensional PDEs. To address this issue, we propose a genetic programming symbolic regression algorithm based on transfer learning and automatic differentiation to solve PDEs. This method uses genetic programming to search for a mathematically understandable expression and combines automatic differentiation to determine whether the search result satisfies the PDE and boundary conditions to be solved. To overcome the problem of slow solution speed caused by large search space, we propose a transfer learning mechanism that transfers the structure of one-dimensional PDE analytical solution to the form of high-dimensional PDE solution. We tested three representative types of PDEs, and the results showed that our proposed method can obtain reliable and human-understandable real solutions or algebraic equivalent solutions of PDEs, and the convergence speed is better than the compared methods. Code of this project is at https://github.com/grassdeerdeer/HD-TLGP.



Paperid:2290
Authors:Qingyun Chen, Sungjin Im, Benjamin Moseley, Chenyang Xu, Ruilong Zhang
Electrical Engineering and Computer Science, University of California at Merced, Electrical Engineering and Computer Science, University of California at Merced, Tepper School of Business, Carnegie Mellon University, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Department of Computer Science and Engineering, University at Buffalo
Abstract:
The feedback arc set problem is one of the most fundamental and wellstudied ranking problems where n objects are to be ordered based on their pairwise comparison. The problem enjoys several efficient approximation algorithms in the offline setting. Unfortunately, online there are strong lower bounds on the competitive ratio establishing that no algorithm can perform well in the worst case. This paper introduces a new beyond-worst-case model for online feedback arc set. In the model, a sample of the input is given to the algorithm offline before the remaining instance is revealed online. This models the case in practice where yesterday's data is available and is similar to today's online instance. This sample is drawn from a known distribution which may not be uniform. We design an online algorithm with strong theoretical guarantees. The algorithm has a small constant competitive ratio when the sample is uniform---if not, we show we can recover the same result by adding a provably minimal sample. Empirical results validate the theory and show that such algorithms can be used on temporal data to obtain strong results.



Paperid:2291
Authors:Samantha Chen, Puoya Tabaghi, Yusu Wang
University of California, San Diego, University of California, San Diego, University of California, San Diego
Abstract:
Optimal transport provides a metric which quantifies the dissimilarity between probability measures. For measures supported in discrete metric spaces, finding the optimal transport distance has cubic time complexity in the size of the space. However, measures supported on trees admit a closedform optimal transport that can be computed in linear time. In this paper, we aim to find an optimal tree structure for a given discrete metric space so that the tree-Wasserstein distance approximates the optimal transport distance in the original space. One of our key ideas is to cast the problem in ultrametric spaces. This helps us optimize over the space of ultrametric trees --- a mixed-discrete and continuous optimization problem --- via projected gradient decent over the space of ultrametric matrices. During optimization, we project the parameters to the ultrametric space via a hierarchical minimum spanning tree algorithm, equivalent to the closest projection to ultrametrics under the supremum norm. Experimental results on real datasets show that our approach outperforms previous approaches (e.g. Flowtree, Quadtree) in approximating optimal transport distances. Finally, experiments on synthetic data generated on ground truth trees show that our algorithm can accurately uncover the underlying trees.



Paperid:2292
Authors:Xianrun Chen, Dachuan Xu, Yicheng Xu, Yong Zhang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China University of Chinese Academy of Sciences, Beijing, China, Beijing University of Technology, Beijing, China, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China University of Chinese Academy of Sciences, Beijing, China, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China University of Chinese Academy of Sciences, Beijing, China
Abstract:
Clustering is one of the most fundamental tools in artificial intelligence, machine learning, and data mining. In this paper, we follow one of the recent mainstream topics of clustering, Sum of Radii (SoR), which naturally arises as a balance between the folklore kcenter and k-median. SoR aims to determine a set of k balls, each centered at a point in a given dataset, such that their union covers the entire dataset while minimizing the sum of radii of the k balls. We propose a general technical framework to overcome the challenge posed by varying radii in SoR, which yields fixed-parameter tractable (fpt) algorithms with respect to k (i.e., whose running time is f(k) ploy(n) for some f). Our framework is versatile and obtains fpt approximation algorithms with constant approximation ratios for SoR as well as its variants in general metrics, such as Fair SoR and Matroid SoR, which significantly improve the previous results.



Paperid:2293
Authors:Zhe Chen, Daniel Harabor, Jiaoyang Li, Peter J. Stuckey
Monash University, Monash University, Carnegie Mellon University, Monash University OPTIMA Australian Research Council ITTC
Abstract:
MultiAgent Path Finding (MAPF) is a fundamental problem in robotics that asks us to compute collision-free paths for a team of agents, all moving across a shared map. Although many works appear on this topic, all current algorithms struggle as the number of agents grows. The principal reason is that existing approaches typically plan free-flow optimal paths, which creates congestion. To tackle this issue, we propose a new approach for MAPF where agents are guided to their destination by following congestion-avoiding paths. We evaluate the idea in two large-scale settings: one-shot MAPF, where each agent has a single destination, and lifelong MAPF, where agents are continuously assigned new destinations. Empirically, we report large improvements in solution quality for one-short MAPF and in overall throughput for lifelong MAPF.



Paperid:2294
Authors:Benjamin Doerr, Aymen Echarghaoui, Mohammed Jamal, Martin S. Krejca
Laboratoire d’Informatique (LIX), CNRS, École Polytechnique, Institut Polytechnique de Paris, École Polytechnique, Institut Polytechnique de Paris, École Polytechnique, Institut Polytechnique de Paris, Laboratoire d’Informatique (LIX), CNRS, École Polytechnique, Institut Polytechnique de Paris
Abstract:
Most evolutionary algorithms used in practice heavily employ crossover. In contrast, the rigorous understanding of how crossover is beneficial is largely lagging behind. In this work, we make a considerable step forward by analyzing the population dynamics of the (µ+1) genetic algorithm when optimizing the Jump benchmark. We observe (and prove via mathematical means) that once the population contains two different individuals on the local optimum, the diversity in the population increases in expectation. From this drift towards more diverse states, we show that a diversity suitable for crossover to be effective is reached quickly and, more importantly, then persists for a time that is at least exponential in the population size µ. This drastically improves over the previously best known guarantee, which is only quadratic in µ. Our new understanding of the population dynamics easily gives stronger performance guarantees. In particular, we derive that population sizes logarithmic in the problem size n suffice to gain an Ω(n)factor runtime improvement from crossover (previous works achieved comparable bounds only with µ = Θ(n) or a non-standard mutation rate).



Paperid:2295
Authors:Simon Dold, Malte Helmert
University of Basel, University of Basel
Abstract:
Classical planning considers a given task and searches for a plan to solve it. Some tasks are harder to solve than others. We can measure the 'hardness' of a task with the novelty width and the correlation complexity. In this work, we compare these measures. Additionally, we introduce the river measure, a new measure that is based on potential heuristics and therefore similar to the correlation complexity but also comparable to the novelty width. We show that the river measure is upper bounded by the correlation complexity and by the novelty width +1. Furthermore, we show that we can convert a planning task with a polynomial blowup of the task size to ensure that a heuristic of dimension 2 exists that gives rise to backtrackfree search.



Paperid:2296
Authors:Kaan Gokcesu, Hakan Gökcesu
MIT Regrify, Bilkent University, Turkey Turkcell Technology, Turkey
Abstract:
We study the problem of global optimization, where we analyze the performance of the PiyavskiiShubert algorithm and its variants. For any given time duration T, instead of the extensively studied simple regret (which is the difference of the losses between the best estimate up to T and the global minimum), we study the cumulative regret up to time T. For L-Lipschitz continuous functions, we show that the cumulative regret is O(L logT). For H-Lipschitz smooth functions, we show that the cumulative regret is O(H). We analytically extend our results for functions with Hölder continuous derivatives, which cover both the Lipschitz continuous and the Lipschitz smooth functions, individually. We further show that a simpler variant of the Piyavskii-Shubert algorithm performs just as well as the traditional variants for the Lipschitz continuous or the Lipschitz smooth functions. We further extend our results to broader classes of functions, and show that, our algorithm efficiently determines its queries; and achieves nearly minimax optimal (up to log factors) cumulative regret, for general convex or even concave regularity conditions on the extrema of the objective (which encompasses many preceding regularities). We consider further extensions by investigating the performance of the Piyavskii-Shubert variants in the scenarios with unknown regularity, noisy evaluation and multivariate domain.



Paperid:2297
Authors:Longkun Guo, Chaoqi Jia, Kewen Liao, Zhigang Lu, Minhui Xue
School of Mathematics and Statistics, Fuzhou University, Fuzhou 350116, China Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250316, China, Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250316, China, HilstLab, Peter Faber Business School, Australian Catholic University, Sydney 2060, Australia, College of Science and Engineering, James Cook University, Townsville 4810, Australia, CSIRO's Data61, Sydney 2015, Australia
Abstract:
Centerbased clustering has attracted significant research interest from both theory and practice. In many practical applications, input data often contain background knowledge that can be used to improve clustering results. In this work, we build on widely adopted k-center clustering and model its input background knowledge as must-link (ML) and cannot-link (CL) constraint sets. However, most clustering problems including k-center are inherently NP-hard, while the more complex constrained variants are known to suffer severer approximation and computation barriers that significantly limit their applicability. By employing a suite of techniques including reverse dominating sets, linear programming (LP) integral polyhedron, and LP duality, we arrive at the first efficient approximation algorithm for constrained k-center with the best possible ratio of 2. We also construct competitive baseline algorithms and empirically evaluate our approximation algorithm against them on a variety of real datasets. The results validate our theoretical findings and demonstrate the great advantages of our algorithm in terms of clustering cost, clustering quality, and running time.



Paperid:2298
Authors:Mingyu Guo, Jialiang Li, Aneta Neumann, Frank Neumann, Hung Nguyen
The University of Adelaide, The University of Adelaide, The University of Adelaide, The University of Adelaide, The University of Adelaide
Abstract:
We propose a combinatorial optimisation model called Limited Query Graph Connectivity Test. We consider a graph whose edges have two possible states (On/Off). The edges' states are hidden initially. We could query an edge to reveal its state. Given a source s and a destination t, we aim to test s−t connectivity by identifying either a path (consisting of only On edges) or a cut (consisting of only Off edges). We are limited to B queries, after which we stop regardless of whether graph connectivity is established. We aim to design a query policy that minimizes the expected number of queries. Our model is mainly motivated by a cyber security use case where we need to establish whether attack paths exist in a given network, between a source (i.e., a compromised user node) and a destination (i.e., a highprivilege admin node). Edge query is resolved by manual effort from the IT admin, which is the motivation behind query minimization. Our model is highly related to Stochastic Boolean Function Evaluation (SBFE). There are two existing exact algorithms for SBFE that are prohibitively expensive. We propose a signifcantly more scalable exact algorithm. While previous exact algorithms only scale for trivial graphs (i.e., past works experimented on at most 20 edges), we empirically demonstrate that our algorithm is scalable for a wide range of much larger practical graphs (i.e., graphs representing Windows domain networks with tens of thousands of edges). We also propose three heuristics. Our best-performing heuristic is via limiting the planning horizon of the exact algorithm. The other two are via reinforcement learning (RL) and Monte Carlo tree search (MCTS). We also derive an algorithm for computing the performance lower bound. Experimentally, we show that all our heuristics are near optimal. The heuristic building on the exact algorithm outperforms all other heuristics, surpassing RL, MCTS and eight existing heuristics ported from SBFE and related literature.



Paperid:2299
Authors:Takashi Horiyama, Yasuaki Kobayashi, Hirotaka Ono, Kazuhisa Seto, Ryu Suzuki
Hokkaido University, Hokkaido University, Nagoya University, Hokkaido University, Hokkaido University
Abstract:
The uniqueness of an optimal solution to a combinatorial optimization problem attracts many fields of researchers' attention because it has a wide range of applications, it is related to important classes in computational complexity, and the existence of only one solution is often critical for algorithm designs in theory. However, as the authors know, there is no major benchmark set consisting of only instances with unique solutions, and no algorithm generating instances with unique solutions is known; a systematic approach to getting a problem instance guaranteed having a unique solution would be helpful. A possible approach is as follows: Given a problem instance, we specify a small part of a solution in advance so that only one optimal solution meets the specification. This paper formulates such a ``preassignment'' approach for the vertex cover problem as a typical combinatorial optimization problem and discusses its computational complexity. First, we show that the problem is ΣP2-complete in general, while the problem becomes NP-complete when an input graph is bipartite. We then present an O(2.1996^n)-time algorithm for general graphs and an O(1.9181^n)-time algorithm for bipartite graphs, where n is the number of vertices. The latter is based on an FPT algorithm with O*(3.6791^τ) time for vertex cover number τ. Furthermore, we show that the problem for trees can be solved in O(1.4143^n) time.



Paperid:2300
Authors:Mingming Jin, Jiongzhi Zheng, Kun He
Huazhong University of Science and Technology, Huazhong University of Science and Technology, Huazhong University of Science and Technology
Abstract:
The Maximum kDefective Clique Problem (MDCP) aims to find a maximum k-defective clique in a given graph, where a k-defective clique is a relaxation clique missing at most k edges. MDCP is NP-hard and finds many real-world applications in analyzing dense but not necessarily complete subgraphs. Exact algorithms for MDCP mainly follow the Branch-and-bound (BnB) framework, whose performance heavily depends on the quality of the upper bound on the cardinality of a maximum k-defective clique. The state-of-the-art BnB MDCP algorithms calculate the upper bound quickly but conservatively as they ignore many possible missing edges. In this paper, we propose a novel CoLoring-based Upper Bound (CLUB) that uses graph coloring techniques to detect independent sets so as to detect missing edges ignored by the previous methods. We then develop a new BnB algorithm for MDCP, called KD-Club, using CLUB in both the preprocessing stage for graph reduction and the BnB searching process for branch pruning. Extensive experiments show that KD-Club significantly outperforms state-of-the-art BnB MDCP algorithms on the number of solved instances within the cut-off time, having much smaller search tree and shorter solving time on various benchmarks.



Paperid:2301
Authors:Ryo Kuroiwa, J. Christopher Beck
University of Toronto, University of Toronto
Abstract:
Domainindependent dynamic programming (DIDP), a model-based paradigm based on dynamic programming, has shown promising performance on multiple combinatorial optimization problems compared with mixed integer programming (MIP) and constraint programming (CP). The current DIDP solvers are based on heuristic search, and the state-of-the-art solver, complete anytime beam search (CABS), uses beam search. However, the current DIDP solvers cannot utilize multiple threads, unlike state-of-the-art MIP and CP solvers. In this paper, we propose three parallel beam search algorithms and develop multi-thread implementations of CABS. With 32 threads, our multi-thread DIDP solvers achieve 9 to 39 times speedup on average and significant performance improvement over the sequential solver, finding the new best solutions for two instances of the traveling salesperson problem with time windows. In addition, our solvers outperform multi-thread MIP and CP solvers in four of the six combinatorial optimization problems evaluated.



Paperid:2302
Authors:Sofia Lemons, Wheeler Ruml, Rob Holte, Carlos Linares Lopez
Earlham College University of New Hampshire, University of New Hampshire, University of Alberta, Universidad Carlos III de Madrid
Abstract:
Anytime heuristic search algorithms try to find a (potentially suboptimal) solution as quickly as possible and then work to find better and better solutions until an optimal solution is obtained or time is exhausted. The most widelyknown anytime search algorithms are based on best-first search. In this paper, we propose a new algorithm, rectangle search, that is instead based on beam search, a variant of breadth-first search. It repeatedly explores alternatives at all depth levels and is thus best-suited to problems featuring deep local minima. Experiments using a variety of popular search benchmarks suggest that rectangle search is competitive with fixed-width beam search and often performs better than the previous best anytime search algorithms.



Paperid:2303
Authors:Haotian Ling, Zhihai Wang, Jie Wang
University of Science and Technology of China, University of Science and Technology of China, University of Science and Technology of China
Abstract:
Cutting planes (cuts) play an important role in solving mixedinteger linear programs (MILPs), as they significantly tighten the dual bounds and improve the solving performance. A key problem for cuts is when to stop cuts generation, which is important for the efficiency of solving MILPs. However, many modern MILP solvers employ hard-coded heuristics to tackle this problem, which tends to neglect underlying patterns among MILPs from certain applications. To address this challenge, we formulate the cuts generation stopping problem as a reinforcement learning problem and propose a novel hybrid graph representation model (HYGRO) to learn effective stopping strategies. An appealing feature of HYGRO is that it can effectively capture both the dynamic and static features of MILPs, enabling dynamic decision-making for the stopping strategies. To the best of our knowledge, HYGRO is the first data-driven method to tackle the cuts generation stopping problem. By integrating our approach with modern solvers, experiments demonstrate that HYGRO significantly improves the efficiency of solving MILPs compared to competitive baselines, achieving up to 31% improvement.



Paperid:2304
Authors:Lu Liu, Mingyu Xiao, Yi Zhou
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
The maximum vertexweighted clique problem (MVWCP) and the maximum edge-weighted clique problem (MEWCP) are two natural extensions of the fundamental maximum clique problem. In this paper, we systematically study MEWCP and make the following major contributions: (1) We show that MEWCP is NP-hard even when the minimum degree of the graph is n-2, in contrast to MVWCP which is polynomial-time solvable when the minimum degree of the graph is at least n-3. This result distinguishes the complexity of the two problems for the first time. (2) To address MEWCP, we develop an efficient branch-and-bound algorithm called MEWCat with both practical and theoretical performance guarantees. In practice, MEWCat utilizes a new upper bound tighter than existing ones, which allows for more efficient pruning of branches. In theory, we prove a running-time bound of O*(1.4423^n) for MEWCat, which breaks the trivial bound of O*(2^n) in the research line of practical exact MEWCP solvers for the first time. (3) Empirically, we evaluate the performance of MEWCat on various benchmark instances. The experiments demonstrate that MEWCat outperforms state-of-the-art exact solvers significantly. For instance, on 16 DIMACS graphs that the state-of-the-art solver BBEWC fails to solve within 7200 seconds, MEWCat solves all of them with an average time of less than 1000 seconds. On real-world graphs, MEWCat achieves an average speedup of over 36x.



Paperid:2305
Authors:Tianhao Lu, Chao Bian, Chao Qian
Nanjing University, Nanjing University, Nanjing University
Abstract:
Evolutionary algorithms (EAs) are widely used for multiobjective optimization due to their population-based nature. Traditional multi-objective EAs (MOEAs) generate a large set of solutions to approximate the Pareto front, leaving a decision maker (DM) with the task of selecting a preferred solution. However, this process can be inefficient and time-consuming, especially when there are many objectives or the DM has subjective preferences. To address this issue, interactive MOEAs (iMOEAs) combine decision making into the optimization process, i.e., update the population with the help of the DM. In contrast to their wide applications, there has existed only two pieces of theoretical works on iMOEAs, which only considered interactive variants of the two simple single-objective algorithms, RLS and (1+1)-EA. This paper provides the first running time analysis (the essential theoretical aspect of EAs) for practical iMOEAs. Specifically, we prove that the expected running time of the well-developed interactive NSGA-II (called R-NSGA-II) for solving the OneMinMax, OneJumpZeroJump problems are all asymptotically faster than the traditional NSGA-II. Meanwhile, we present a variant of OneMinMax, and prove that R-NSGA-II can be exponentially slower than NSGA-II. These results provide theoretical justification for the effectiveness of iMOEAs while identifying situations where they may fail. Experiments are also conducted to validate the theoretical results.



Paperid:2306
Authors:Kyle Mana, Fernando Acero, Stephen Mak, Parisa Zehtabi, Michael Cashmore, Daniele Magazzeni, Manuela Veloso
J.P. Morgan AI Research, J.P. Morgan AI Research University College London, University of Cambridge, J.P. Morgan AI Research, J.P. Morgan AI Research, J.P. Morgan AI Research, J.P. Morgan AI Research
Abstract:
Discrete optimization belongs to the set of N Phard problems, spanning fields such as mixed-integer programming and combinatorial optimization. A current standard approach to solving convex discrete optimization problems is the use of cutting-plane algorithms, which reach optimal solutions by iteratively adding inequalities known as cuts to refine a feasible set. Despite the existence of a number of general-purpose cut-generating algorithms, large-scale discrete optimization problems continue to suffer from intractability. In this work, we propose a method for accelerating cutting-plane algorithms via reinforcement learning. Our approach uses learned policies as surrogates for N P-hard elements of the cut generating procedure in a way that (i) accelerates convergence, and (ii) retains guarantees of optimality. We apply our method on two types of problems where cutting-plane algorithms are commonly used: stochastic optimization, and mixed-integer quadratic programming. We observe the benefits of our method when applied to Benders decomposition (stochastic optimization) and iterative loss approximation (quadratic programming), achieving up to 45% faster average convergence when compared to modern alternative algorithms.



Paperid:2307
Authors:Konstantin Sidorov, Gonçalo Homem de Almeida Correia, Mathijs de Weerdt, Emir Demirović
Delft University of Technology, Delft University of Technology, Delft University of Technology, Delft University of Technology
Abstract:
People want to rely on optimization algorithms for complex decisions but verifying the optimality of the solutions can then become a valid concern, particularly for critical decisions taken by nonexperts in optimization. One example is the shortest-path problem on a network, occurring in many contexts from transportation to logistics to telecommunications. While the standard shortest-path problem is both solvable in polynomial time and certifiable by duality, introducing side constraints makes solving and certifying the solutions much harder. We propose a proof system for constrained shortest-path problems, which gives a set of logical rules to derive new facts about feasible solutions. The key trait of the proposed proof system is that it specifically includes high-level graph concepts within its reasoning steps (such as connectivity or path structure), in contrast to, e.g., using linear combinations of model constraints. Thus, using our proof system, we can provide a step-by-step, human-auditable explanation showing that the path given by an external solver cannot be improved. Additionally, to maximize the advantages of this setup, we propose a proof search procedure that specifically aims to find small proofs of this form using a procedure similar to A* search. We evaluate our proof system on constrained shortest path instances generated from real-world road networks and experimentally show that we may indeed derive more interpretable proofs compared to an integer programming approach, in some cases leading to much smaller proofs.



Paperid:2308
Authors:Rui Sun, Zhi Zheng, Zhenkun Wang
Southern University of Science and Technology, Southern University of Science and Technology, Southern University of Science and Technology
Abstract:
Deepreinforcement-learning (DRL) based neural combinatorial optimization (NCO) methods have demonstrated efficiency without relying on the guidance of optimal solutions. As the most mainstream among them, the learning constructive heuristic (LCH) achieves high-quality solutions through a rapid autoregressive solution construction process. However, these LCH-based methods are deficient in convergency, and there is still a performance gap compared to the optimal. Intuitively, learning to regret some steps in the solution construction process is helpful to the training efficiency and network representations. This article proposes a novel regret-based mechanism for an advanced solution construction process. Our method can be applied as a plug-in to any existing LCH-based DRL-NCO method. Experimental results demonstrate the capability of our work to enhance the performance of various NCO models. Results also show that the proposed LCH-Regret outperforms the previous modification methods on several typical combinatorial optimization problems. The code and Supplementary File are available at https://github.com/SunnyR7/LCH-Regret.



Paperid:2309
Authors:Hao Tian, Sourav Medya, Wei Ye
Tongji University, UIC, Tongji University
Abstract:
Combinatorial Optimization (CO) problems over graphs appear routinely in many applications such as in optimizing traffic, viral marketing in social networks, and matching for job allocation. Due to their combinatorial nature, these problems are often NPhard. Existing approximation algorithms and heuristics rely on the search space to find the solutions and become time-consuming when this space is large. In this paper, we design a neural method called COMBHelper to reduce this space and thus improve the efficiency of the traditional CO algorithms based on node selection. Specifically, it employs a Graph Neural Network (GNN) to identify promising nodes for the solution set. This pruned search space is then fed to the traditional CO algorithms. COMBHelper also uses a Knowledge Distillation (KD) module and a problem-specific boosting module to bring further efficiency and efficacy. Our extensive experiments show that the traditional CO algorithms with COMBHelper are at least 2 times faster than their original versions.



Paperid:2310
Authors:Hao Wang
Zhejiang University
Abstract:
Enhancing the generalization performance of neural networks given limited data availability remains a formidable challenge, due to the model selection tradeoff between training error and generalization gap. To handle this challenge, we present a posterior optimization issue, specifically designed to reduce the generalization error of trained neural networks. To operationalize this concept, we propose a Doubly-Robust Boosting machine (DRBoost) which consists of a statistical learner and a zero-order optimizer. The statistical learner reduces the model capacity and thus the generalization gap; the zero-order optimizer minimizes the training error in a gradient-free manner. The two components cooperate to reduce the generalization error of a fully trained neural network in a doubly robust manner. Furthermore, the statistical learner alleviates the multicollinearity in the discriminative layer and enhances the generalization performance. The zero-order optimizer eliminates the reliance on gradient calculation and offers more flexibility in learning objective selection. Experiments demonstrate that DRBoost improves the generalization performance of various prevalent neural network backbones effectively.



Paperid:2311
Authors:Xiaofan Wang, Zhiyuan Deng, Changle Wang, Jinjia Wang
Yanshan University, Yanshan University, Yanshan University, Yanshan University
Abstract:
Light field microscopy is a highspeed 3D imaging technique that records the light field from multiple angles by the microlens array(MLA), thus allowing us to obtain information about the light source from a single image only. For the fundamental problem of neuron localization, we improve the method of combining depth-dependent dictionary with sparse coding in this paper. In order to obtain higher localization accuracy and good noise immunity, we propose an inertial proximal gradient acceleration algorithm with dry friction, Fast-IPGDF. By preventing falling into a local minimum, our algorithm achieves better convergence and converges quite fast, which improves the speed and accuracy of obtaining the locolization of the light source based on the matching depth of epipolar plane images (EPI). We demonstrate the effectiveness of the algorithm for localizing non-scattered fluorescent beads in both noisy and non-noisy environments. The experimental results show that our method can achieve simultaneous localization of multiple point sources and effective localization in noisy environments. Compared to existing studies, our method shows significant improvements in both localization accuracy and speed.



Paperid:2312
Authors:Xiaoyang Xu, Hu Ding
University of Science and Technology of China, University of Science and Technology of China
Abstract:
Optimal transport is a fundamental topic that has attracted a great amount of attention from the optimization community in the past decades. In this paper, we consider an interesting discrete dynamic optimal transport problem: can we efficiently update the optimal transport plan when the weights or the locations of the data points change? This problem is naturally motivated by several applications in machine learning. For example, we often need to compute the optimal transport cost between two different data sets; if some changes happen to a few data points, should we recompute the high complexity cost function or update the cost by some efficient dynamic data structure? We are aware that several dynamic maximum flow algorithms have been proposed before, however, the research on dynamic minimum cost flow problem is still quite limited, to the best of our knowledge. We propose a novel 2D Skip Orthogonal List together with some dynamic tree techniques. Although our algorithm is based on the conventional simplex method, it can efficiently find the variable to pivot within expected O(1) time, and complete each pivoting operation within expected O(|V|) time where V is the set of all supply and demand nodes. Since dynamic modifications typically do not introduce significant changes, our algorithm requires only a few simplex iterations in practice. So our algorithm is more efficient than re-computing the optimal transport cost that needs at least one traversal over all |E|=O(|V|^2) variables, where |E| denotes the number of edges in the network. Our experiments demonstrate that our algorithm significantly outperforms existing algorithms in the dynamic scenarios.



Paperid:2313
Authors:Yaoguang Zhai, Zhizhen Qin, Sicun Gao
University of California, San Diego, University of California, San Diego, University of California, San Diego
Abstract:
Standard approaches for global optimization of nonconvex functions, such as branch-and-bound, maintain partition trees to systematically prune the domain. The tree size grows exponentially in the number of dimensions. We propose new sampling-based methods for non-convex optimization that adapts Monte Carlo Tree Search (MCTS) to improve efficiency. Instead of the standard use of visitation count in Upper Confidence Bounds, we utilize numerical overapproximations of the objective as an uncertainty metric, and also take into account of sampled estimates of first-order and second-order information. The Monte Carlo tree in our approach avoids the usual fixed combinatorial patterns in growing the tree, and aggressively zooms into the promising regions, while still balancing exploration and exploitation. We evaluate the proposed algorithms on high-dimensional non-convex optimization benchmarks against competitive baselines and analyze the effects of the hyper parameters.



Paperid:2314
Authors:Qingyun Zhang, Yuming Du, Zhouxing Su, Chu-Min Li, Junzhou Xu, Zhihuai Chen, Zhipeng Lü
School of Computer Science and Technology, Huazhong University of Science and Technology, China, School of Computer Science and Technology, Huazhong University of Science and Technology, China, School of Computer Science and Technology, Huazhong University of Science and Technology, China, MIS, University of Picardie Jules Verne, France, TCS Lab, Huawei Technologies Co., Ltd., China, TCS Lab, Huawei Technologies Co., Ltd., China, School of Computer Science and Technology, Huazhong University of Science and Technology, China
Abstract:
As a classical NPhard problem and the topic of the PACE 2022 competition, the directed feedback vertex set problem (DFVSP) aims to find a minimum subset of vertices such that, when vertices in the subset and all their adjacent edges are removed from the directed graph, the remainder graph is acyclic. In this paper, we propose a threshold-based responsive simulated annealing algorithm called TRSA for solving DFVSP. First, we simplify the problem instances with two new reduction rules proposed in this paper and eight reduction rules from the literature. Then, based on a new solution representation, TRSA solves DFVSP with a fast local search procedure featured by a swap-based neighborhood structure and three neighborhood acceleration strategies. Finally, all these strategies are incorporated into a threshold-based responsive simulated annealing framework. Computational experiments on 140 benchmark instances show that TRSA is highly competitive compared to the state-of-the-art methods. Specifically, TRSA can improve the best known results for 53 instances, while matching the best known results for 79 ones. Furthermore, some important features of TRSA are analyzed to identify its success factors.



Paperid:2315
Authors:Xuan Zhang, Gabriel Mancino-Ball, Necdet Serhat Aybat, Yangyang Xu
Pennsylvania State University, Rensselaer Polytechnic Institute, Pennsylvania State University, Rensselaer Polytechnic Institute
Abstract:
We propose a novel singleloop decentralized algorithm, DGDA-VR, for solving the stochastic nonconvex strongly-concave minimax problems over a connected network of agents, which are equipped with stochastic first-order oracles to estimate their local gradients. DGDA-VR, incorporating variance reduction, achieves O(ε^−3) oracle complexity and O(ε^−2) communication complexity without resorting to multi-communication rounds – both are optimal, i.e., matching the lower bounds for this class of problems. Since DGDA-VR does not require multiple communication rounds, it is applicable to a broader range of decentralized computational environments. To the best of our knowledge, this is the first distributed method using a single communication round in each iteration to jointly optimize the oracle and communication complexities for the problem considered here.



Paperid:2316
Authors:Weijie Zheng, Benjamin Doerr
Harbin Institute of Technology, Shenzhen, École Polytechnique
Abstract:
The widely used multiobjective optimizer NSGAII was recently proven to have considerable difficulties in many-objective optimization. In contrast, experimental results in the literature show a good performance of the SMS-EMOA, which can be seen as a steady-state NSGA-II that uses the hypervolume contribution instead of the crowding distance as the second selection criterion. This paper conducts the first rigorous runtime analysis of the SMS-EMOA for many-objective optimization. To this aim, we first propose a many-objective counterpart, the m-objective mOJZJ problem, of the bi-objective OJZJ benchmark, which is the first many-objective multimodal benchmark used in a mathematical runtime analysis. We prove that SMS-EMOA computes the full Pareto front of this benchmark in an expected number of O(M^2 n^k) iterations, where n denotes the problem size (length of the bit-string representation), k the gap size (a difficulty parameter of the problem), and M=(2n/m-2k+3)^(m/2) the size of the Pareto front. This result together with the existing negative result on the original NSGA-II shows that in principle, the general approach of the NSGA-II is suitable for many-objective optimization, but the crowding distance as tie-breaker has deficiencies. We obtain three additional insights on the SMS-EMOA. Different from a recent result for the bi-objective OJZJ benchmark, the stochastic population update often does not help for mOJZJ. It results in a 1/Θ(min(Mk^(1/2)/2^(k/2),1)) speed-up, which is Θ(1) for large m such as m>k. On the positive side, we prove that heavy-tailed mutation still results in a speed-up of order k^(0.5+k-β). Finally, we conduct the first runtime analyses of the SMS-EMOA on the bi-objective OneMinMax and LOTZ benchmarks and show that it has a performance comparable to the GSEMO and the NSGA-II.



Paperid:2317
Authors:Weijie Zheng, Mingfeng Li, Renzhong Deng, Benjamin Doerr
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, École Polytechnique
Abstract:
The Metropolis algorithm can cope with local optima by accepting inferior solutions with suitably small probability. That this can work well was not only observed in empirical research, but also via mathematical runtime analyses on singleobjective benchmarks. This paper takes several steps towards understanding, again via theoretical means, whether such advantages can also be obtained in multi-objective optimization. The original Metropolis algorithm has two components, one-bit mutation and the acceptance strategy, which allows accepting inferior solutions. When adjusting the acceptance strategy to multi-objective optimization in the way that an inferior solution that is accepted replaces its parent, then the Metropolis algorithm is not very efficient on our multi-objective version of the multimodal DLB benchmark called DLTB. With one-bit mutation, this multi-objective Metropolis algorithm cannot optimize the DLTB problem, with standard bit-wise mutation it needs at least Ω(n^5) time to cover the full Pareto front. In contrast, we show that many other multi-objective optimizers, namely the GSEMO, SMS-EMOA, and NSGA-II, only need time O(n^4). When keeping the parent when an inferior point is accepted, the multi-objective Metropolis algorithm both with one-bit or standard bit-wise mutation solves the DLTB problem efficiently, with one-bit mutation experimentally leading to better results than several other algorithms. Overall, our work suggests that the general mechanism of the Metropolis algorithm can be interesting in multi-objective optimization, but that the implementation details can have a huge impact on the performance.



Paperid:2318
Authors:Qingling Zhu, Xiaoqiang Wu, Qiuzhen Lin, Wei-Neng Chen
Shenzhen University, Shenzhen University, Shenzhen University, South China University of Technology
Abstract:
The integration of Evolutionary Algorithm (EA) and Reinforcement Learning (RL) has emerged as a promising approach for tackling some challenges in RL, such as sparse rewards, lack of exploration, and brittle convergence properties. However, existing methods often employ actor networks as individuals of EA, which may constrain their exploratory capabilities, as the entire actor population will stop evolution when the critic network in RL falls into local optimal. To alleviate this issue, this paper introduces a Twostage Evolutionary Reinforcement Learning (TERL) framework that maintains a population containing both actor and critic networks. TERL divides the learning process into two stages. In the initial stage, individuals independently learn actor-critic networks, which are optimized alternatively by RL and Particle Swarm Optimization (PSO). This dual optimization fosters greater exploration, curbing susceptibility to local optima. Shared information from a common replay buffer and PSO algorithm substantially mitigates the computational load of training multiple agents. In the subsequent stage, TERL shifts to a refined exploitation phase. Here, only the best individual undergoes further refinement, while the rest individuals continue PSO-based optimization. This allocates more computational resources to the best individual for yielding superior performance. Empirical assessments, conducted across a range of continuous control problems, validate the efficacy of the proposed TERL paradigm.



Paperid:2319
Authors:Eslam Abdelrahman, Pengzhan Sun, Li Erran Li, Mohamed Elhoseiny
King Abdullah University of Science and Technology (KAUST), National University of Singapore, AWS AI, Amazon, King Abdullah University of Science and Technology (KAUST)
Abstract:
Most pretrained learning systems are known to suffer from bias, which typically emerges from the data, the model, or both. Measuring and quantifying bias and its sources is a challenging task and has been extensively studied in image captioning. Despite the significant effort in this direction, we observed that existing metrics lack consistency in the inclusion of the visual signal. In this paper, we introduce a new bias assessment metric, dubbed ImageCaptioner2, for image captioning. Instead of measuring the absolute bias in the model or the data, ImageCaptioner2pay more attention to the bias introduced by the model w.r.t the data bias, termed bias amplification. Unlike the existing methods, which only evaluate the image captioning algorithms based on the generated captions only, ImageCaptioner2incorporates the image while measuring the bias. In addition, we design a formulation for measuring the bias of generated captions as prompt-based image captioning instead of using language classifiers. Finally, we apply our ImageCaptioner2metric across 11 different image captioning architectures on three different datasets, i.e., MS-COCO caption dataset, Artemis V1, and Artemis V2, and on three different protected attributes, i.e., gender, race, and emotions. Consequently, we verify the effectiveness of our ImageCaptioner2metric by proposing Anonymous-Bench, which is a novel human evaluation paradigm for bias metrics. Our metric shows significant superiority over the recent bias metric; LIC, in terms of human alignment, where the correlation scores are 80% and 54% for our metric and LIC, respectively. The code and more details are available at https://eslambakr.github.io/imagecaptioner2.github.io/.



Paperid:2320
Authors:Kevin-Martin Aigner, Marc Goerigk, Michael Hartisch, Frauke Liers, Arthur Miehlich
Friedrich-Alexander Universität Erlangen-Nürnberg, University of Passau, University of Siegen, Friedrich-Alexander Universität Erlangen-Nürnberg, Friedrich-Alexander Universität Erlangen-Nürnberg
Abstract:
Advancements in mathematical programming have made it possible to efficiently tackle largescale real-world problems that were deemed intractable just a few decades ago. However, provably optimal solutions may not be accepted due to the perception of optimization software as a black box. Although well understood by scientists, this lacks easy accessibility for practitioners. Hence, we advocate for introducing the explainability of a solution as another evaluation criterion, next to its objective value, which enables us to find trade-off solutions between these two criteria. Explainability is attained by comparing against (not necessarily optimal) solutions that were implemented in similar situations in the past. Thus, solutions are preferred that exhibit similar features. Although we prove that already in simple cases the explainable model is NP-hard, we characterize relevant polynomially solvable cases such as the explainable shortest path problem. Our numerical experiments on both artificial as well as real-world road networks show the resulting Pareto front. It turns out that the cost of enforcing explainability can be very small.



Paperid:2321
Authors:Kasun Amarasinghe, Kit T. Rodolfa, Sérgio Jesus, Valerie Chen, Vladimir Balayan, Pedro Saleiro, Pedro Bizarro, Ameet Talwalkar, Rayid Ghani
Carnegie Mellon University, Pittsburgh, PA, Stanford University, Palo Alto, CA, Feedzai, Lisboa, Portugal, Carnegie Mellon University, Pittsburgh, PA, Feedzai, Lisboa, Portugal, Feedzai, Lisboa, Portugal, Feedzai, Lisboa, Portugal, Carnegie Mellon University, Pittsburgh, PA, Carnegie Mellon University, Pittsburgh, PA
Abstract:
Most existing evaluations of explainable machine learning (ML) methods rely on simplifying assumptions or proxies that do not reflect realworld use cases; the handful of more robust evaluations on real-world settings have shortcomings in their design, generally leading to overestimation of methods' real-world utility. In this work, we seek to address this by conducting a study that evaluates post-hoc explainable ML methods in a setting consistent with the application context and provide a template for future evaluation studies. We modify and improve a prior study on e-commerce fraud detection by relaxing the original work's simplifying assumptions that departed from the deployment context. Our study finds no evidence for the utility of the tested explainable ML methods in the context, which is a drastically different conclusion from the earlier work. This highlights how seemingly trivial experimental design choices can yield misleading conclusions about method utility. In addition, our work carries lessons about the necessity of not only evaluating explainable ML methods using tasks, data, users, and metrics grounded in the intended application context but also developing methods tailored to specific applications, moving beyond general-purpose explainable ML methods.



Paperid:2322
Authors:Jose A. Ayala-Romero, Andres Garcia-Saavedra, Xavier Costa-Perez
NEC Laboratories Europe, NEC Laboratories Europe, ICREA and i2CAT NEC Laboratories Europe
Abstract:
Recent advances in learning techniques have garnered attention for their applicability to a diverse range of realworld sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach satisfies consistently the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).



Paperid:2323
Authors:Daniel Bethell, Simos Gerasimou, Radu Calinescu
University of York, University of York, University of York
Abstract:
Deploying deep learning models in safetycritical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model’s confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over comparable UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple. The MC-CP code and replication package is available at https://github.com/team-daniel/MC-CP.



Paperid:2324
Authors:Chengtai Cao, Xinhong Chen, Jianping Wang, Qun Song, Rui Tan, Yung-Hui Li
City University of Hong Kong, City University of Hong Kong, City University of Hong Kong, Delft University of Technology, Nanyang Technological University, Hon Hai Research Institute
Abstract:
Autonomous driving systems rely on precise trajectory prediction for safe and efficient motion planning. Despite considerable efforts to enhance prediction accuracy, inherent uncertainties persist due to data noise and incomplete observations. Many strategies entail formalizing prediction outcomes into distributions and utilizing variance to represent uncertainty. However, our experimental investigation reveals that existing trajectory prediction models yield unreliable uncertainty estimates, necessitating additional customized calibration processes. On the other hand, directly applying current calibration techniques to prediction outputs may yield suboptimal results due to using a universal scaler for all predictions and neglecting informative data cues. In this paper, we propose Customized Calibration Temperature with Regularizer (CCTR), a generic framework that calibrates the output distribution. Specifically, CCTR 1) employs a calibration-based regularizer to align output variance with the discrepancy between prediction and ground truth and 2) generates a tailor-made temperature scaler for each prediction using a post-processing network guided by context and historical information. Extensive evaluation involving multiple prediction and planning methods demonstrates the superiority of CCTR over existing calibration algorithms and uncertainty-aware methods, with significant improvements of 11%-22% in calibration quality and 17%-46% in motion planning.



Paperid:2325
Authors:Haotian Chen, Lingwei Zhang, Yiran Liu, Yang Yu
Fudan University, Johns Hopkins University, Tsinghua University, Tsinghua University
Abstract:
While large language models (LLMs) exhibit impressive performance on a wide range of NLP tasks, most of them fail to learn the causality from correlation, which disables them from learning rationales for predicting. Rethinking the whole developing process of LLMs is of great urgency as they are adopted in various critical tasks that need rationales, including legal text prediction (e.g., legal judgment prediction). In this paper, we first explain the underlying theoretical mechanism of their failure and argue that both the data imbalance and the omission of causality in model design and selection render the current trainingtesting paradigm failed to select the unique causality-based model from correlation-based models. Second, we take the legal text prediction task as the testbed and reconstruct the developing process of LLMs by simultaneously infusing causality into model architectures and organizing causality-based adversarial attacks for evaluation. Specifically, we base our reconstruction on our theoretical analysis and propose a causality-aware self-attention mechanism (CASAM), which prevents LLMs from entangling causal and non-causal information by restricting the interaction between causal and non-causal words. Meanwhile, we propose eight kinds of legal-specific attacks to form causality-based model selection. Our extensive experimental results demonstrate that our proposed CASAM achieves state-of-the-art (SOTA) performances and the strongest robustness on three commonly used legal text prediction benchmarks. We make our code publicly available at https://github.com/Carrot-Red/Rethink-LLM-development.



Paperid:2326
Authors:Zhongzhi Chen, Xingwu Sun, Xianfeng Jiao, Fengzong Lian, Zhanhui Kang, Di Wang, Chengzhong Xu
Beihang University Tencent Inc., Tencent Inc. University of Macau, Tencent Inc., Tencent Inc., Tencent Inc., Tencent Inc., University of Macau
Abstract:
Despite the great success of large language models (LLMs) in various tasks, they suffer from generating hallucinations. We introduce Truth Forest, a method that enhances truthfulness in LLMs by uncovering hidden truth representations using multidimensional orthogonal probes. Specifically, it creates multiple orthogonal bases for modeling truth by incorporating orthogonal constraints into the probes. Moreover, we introduce Random Peek, a systematic technique considering an extended range of positions within the sequence, reducing the gap between discerning and generating truth features in LLMs. By employing this approach, we improved the truthfulness of Llama-2-7B from 40.8% to 74.5% on TruthfulQA. Likewise, significant improvements are observed in fine-tuned models. We conducted a thorough analysis of truth features using probes. Our visualization results show that orthogonal probes capture complementary truth-related features, forming well-defined clusters that reveal the inherent structure of the dataset.



Paperid:2327
Authors:Minjae Cho, Chuangchuang Sun
Mississippi State University, Mississippi State University
Abstract:
Despite remarkable achievements in artificial intelligence, the deployability of learningenabled systems in high-stakes real-world environments still faces persistent challenges. For example, in safety-critical domains like autonomous driving, robotic manipulation, and healthcare, it is crucial not only to achieve high performance but also to comply with given constraints. Furthermore, adaptability becomes paramount in non-stationary domains, where environmental parameters are subject to change. While safety and adaptability are recognized as key qualities for the new generation of AI, current approaches have not demonstrated effective adaptable performance in constrained settings. Hence, this paper breaks new ground by studying the unique challenges of ensuring safety in nonstationary environments by solving constrained problems through the lens of the meta-learning approach (learning to learn). While unconstrained meta-learning already encounters complexities in end to end differentiation of the loss due to the bi-level nature, its constrained counterpart introduces an additional layer of difficulty, since the constraints imposed on task-level updates complicate the differentiation process. To address the issue, we first employ successive convex-constrained policy updates across multiple tasks with differentiable convex programming, which allows meta-learning in constrained scenarios by enabling end-to-end differentiation. This approach empowers the agent to rapidly adapt to new tasks under nonstationarity while ensuring compliance with safety constraints. We also provide a theoretical analysis demonstrating guaranteed monotonic improvement of our approach, justifying our algorithmic designs. Extensive simulations across diverse environments provide empirical validation with significant improvement over established benchmarks.



Paperid:2328
Authors:Matthew Cleaveland, Insup Lee, George J. Pappas, Lars Lindemann
University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Southern California
Abstract:
Conformal prediction is a statistical tool for producing prediction regions of machine learning models that are valid with high probability. However, applying conformal prediction to time series data leads to conservative prediction regions. In fact, to obtain prediction regions over T time steps with confidence 1delta, previous works require that each individual prediction region is valid with confidence 1--delta/T. We propose an optimization-based method for reducing this conservatism to enable long horizon planning and verification when using learning-enabled time series predictors. Instead of considering prediction errors individually at each time step, we consider a parameterized prediction error over multiple time steps. By optimizing the parameters over an additional dataset, we find prediction regions that are not conservative. We show that this problem can be cast as a mixed integer linear complementarity program (MILCP), which we then relax into a linear complementarity program (LCP). Additionally, we prove that the relaxed LP has the same optimal cost as the original MILCP. Finally, we demonstrate the efficacy of our method on case studies using pedestrian trajectory predictors and F16 fighter jet altitude predictors.



Paperid:2329
Authors:Seffi Cohen, Ofir Arbili, Yisroel Mirsky, Lior Rokach
Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Ben-Gurion University of the Negev
Abstract:
Decision trees are widely used for addressing learning tasks involving tabular data. Yet, they are susceptible to adversarial attacks. In this paper, we present Tree Test Time Simulation (TTTS), a novel inferencetime methodology that incorporates Monte Carlo simulations into decision trees to enhance their robustness. TTTS introduces a probabilistic modification to the decision path, without altering the underlying tree structure. Our comprehensive empirical analysis of 50 datasets yields promising results. Without the presence of any attacks, TTTS has successfully improved model performance from an AUC of 0.714 to 0.773. Under the challenging conditions of white-box attacks, TTTS demonstrated its robustness by boosting performance from an AUC of 0.337 to 0.680. Even when subjected to black-box attacks, TTTS maintains high accuracy and enhances the model's performance from an AUC of 0.628 to 0.719. Compared to defenses such as Feature Squeezing, TTTS proves to be much more effective. We also found that TTTS exhibits similar robustness in decision forest settings across different attacks.



Paperid:2330
Authors:Carl De Sousa Trias, Mihai Petru Mitrea, Attilio Fiandrotti, Marco Cagnazzo, Sumanta Chaudhuri, Enzo Tartaglione
Télécom SudParis, Institut Polytechnique de Paris, France, Télécom SudParis, Institut Polytechnique de Paris, France, University of Turin, Italy, University of Padua, Italy, LTCI, Télécom Paris, Institut Polytechnique de Paris, France, LTCI, Télécom Paris, Institut Polytechnique de Paris, France
Abstract:
Deep neural networks are characterized by multiple symmetrical, equiloss solutions that are redundant. Thus, the order of neurons in a layer and feature maps can be given arbitrary permutations, without affecting (or minimally affecting) their output. If we shuffle these neurons, or if we apply to them some perturbations (like fine-tuning) can we put them back in the original order i.e. re-synchronize? Is there a possible corruption threat? Answering these questions is important for applications like neural network white-box watermarking for ownership tracking and integrity verification. We advance a method to re-synchronize the order of permuted neurons. Our method is also effective if neurons are further altered by parameter pruning, quantization, and fine-tuning, showing robustness to integrity attacks. Additionally, we provide theoretical and practical evidence for the usual means to corrupt the integrity of the model, resulting in a solution to counter it. We test our approach on popular computer vision datasets and models, and we illustrate the threat and our countermeasure on a popular white-box watermarking method.



Paperid:2331
Authors:Virginie Debauche, Alec Edwards, Raphaël M. Jungers, Alessandro Abate
ICTEAM Institute, UCLouvain, Louvain-la-Neuve, Belgium, Department of Computer Science, University of Oxford, United Kingdom, ICTEAM Institute, UCLouvain, Louvain-la-Neuve, Belgium, Department of Computer Science, University of Oxford, United Kingdom
Abstract:
Neuralbased, data-driven analysis and control of dynamical systems have been recently investigated and have shown great promise, e.g. for safety verification or stability analysis. Indeed, not only do neural networks allow for an entirely model-free, data-driven approach, but also for handling arbitrary complex functions via their power of representation (as opposed to, e.g. algebraic optimization techniques that are restricted to polynomial functions). Whilst classical Lyapunov techniques allow to provide a formal and robust guarantee of stability of a switched dynamical system, very little is yet known about correctness guarantees for Neural Lyapunov functions, nor about their performance (amount of data needed for a certain accuracy). We formally introduce Neural Lyapunov functions for the stability analysis of switched linear systems: we benchmark them on this paradigmatic problem, which is notoriously difficult (and in general Turing-undecidable), but which admits existing recently-developed technologies and theoretical results. Inspired by switched systems theory, we provide theoretical guarantees on the representative power of neural networks, leveraging recent results from the ML community. We additionally experimentally display how Neural Lyapunov functions compete with state-of-the-art results and techniques, while admitting a wide range of improvement, both in theory and in practice. This study intends to improve our understanding of the opportunities and current limitations of neural-based data-driven analysis and control of complex dynamical systems.



Paperid:2332
Authors:Laurens Devos, Lorenzo Cascioli, Jesse Davis
KU Leuven, KU Leuven, KU Leuven
Abstract:
Tree ensembles are one of the most widely used model classes. However, these models are susceptible to adversarial examples, which are slightly perturbed examples that elicit a misprediction. There has been significant research on designing approaches to verify the robustness of tree ensembles to such attacks. However, existing verification algorithms for tree ensembles are only able to analyze binary classifiers and hence address multiclass problems by reducing them to binary ones using a oneversus-other strategy. In this paper, we show that naively applying this strategy can yield incorrect results in certain situations. We address this shortcoming by proposing a novel approximate heuristic approach to verification for multiclass tree ensembles. Our approach is based on a novel generalization of the verification task, which we show emits other relevant verification queries.



Paperid:2333
Authors:Sumanta Dey, Pallab Dasgupta, Soumyajit Dey
Indian Institute of Technology Kharagpur, Synopsys, Indian Institute of Technology Kharagpur
Abstract:
Safe Reinforcement Learning (SRL) algorithms aim to learn a policy that maximizes the reward while satisfying the safety constraints. One of the challenges in SRL is that it is often difficult to balance the two objectives of reward maximization and safety constraint satisfaction. Existing algorithms utilize constraint optimization techniques like penaltybased, barrier penalty-based, and Lagrangian-based dual or primal policy optimizations methods. However, they suffer from training oscillations and approximation errors, which impact the overall learning objectives. This paper proposes the Permeable Penalty Barrier-based Policy Optimization (P2BPO) algorithm that addresses this issue by allowing a small fraction of penalty beyond the penalty barrier, and a parameter is used to control this permeability. In addition, an adaptive penalty parameter is used instead of a constant one, which is initialized with a low value and increased gradually as the agent violates the safety constraints. We have also provided a theoretical proof of the proposed method's performance guarantee bound, which ensures that P2BPO can learn a policy satisfying the safety constraints with high probability while achieving a higher expected reward. Furthermore, we compare P2BPO with other SRL algorithms on various SRL tasks and demonstrate that it achieves better rewards while adhering to the constraints.



Paperid:2334
Authors:Mischa Dombrowski, Hadrien Reynaud, Johanna P. Müller, Matthew Baugh, Bernhard Kainz
Friedrich-Alexander-Universität Erlangen-Nürnberg, DE, Imperial College London, UK, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE, Imperial College London, UK, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Imperial College London, UK
Abstract:
Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning research, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets. Notably, this method has been readily employed for medical applications, such as X-ray image synthesis, leveraging the plethora of associated radiology reports. Yet, a prevailing concern is the lack of assurance on whether these models genuinely comprehend their generated content. With the evolution of text conditional image generation, these models have grown potent enough to facilitate object localization scrutiny. Our research underscores this advancement in the critical realm of medical imaging, emphasizing the crucial role of interpretability. We further unravel a consequential trade-off between image fidelity – as gauged by conventional metrics – and model interpretability in generative diffusion models. Specifically, the adoption of learnable text encoders when fine-tuning results in diminished interpretability. Our in-depth exploration uncovers the underlying factors responsible for this divergence. Consequently, we present a set of design principles for the development of truly interpretable generative models. Code is available at https://github.com/MischaD/chest-distillation.



Paperid:2335
Authors:Maximilian Dreyer, Frederik Pahde, Christopher J. Anders, Wojciech Samek, Sebastian Lapuschkin
Fraunhofer Heinrich Hertz Institute, Fraunhofer Heinrich Hertz Institute, Technical University of Berlin, Fraunhofer Heinrich Hertz Institute Technical University of Berlin BIFOLD – Berlin Institute for the Foundations of Learning and Data, Fraunhofer Heinrich Hertz Institute
Abstract:
Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for highstake decision-making, such as in medical applications. Current methods for post-hoc model correction either require input-level annotations which are only possible for spatially localized biases, or augment the latent feature space, thereby hoping to enforce the right reasons. We present a novel method for model correction on the concept level that explicitly reduces model sensitivity towards biases via gradient penalization. When modeling biases via Concept Activation Vectors, we highlight the importance of choosing robust directions, as traditional regression-based approaches such as Support Vector Machines tend to result in diverging directions. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures. Code and Appendix are available on https://github.com/frederikpahde/rrclarc.



Paperid:2336
Authors:Hasan Ferit Eniser, Valentin Wüstholz, Maria Christakis
MPI-SWS, Germany, ConsenSys, Austria, TU Wien, Austria
Abstract:
Large language models are becoming increasingly practical for translating code across programming languages, a process known as transpiling. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, propertybased testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations.



Paperid:2337
Authors:Sofiane Ennadir, Yassine Abbahaddou, Johannes F. Lutzeyer, Michalis Vazirgiannis, Henrik Boström
KTH Royal Institute of Technology, Ecole Polytechnique, Ecole Polytechnique, Ecole Polytechnique KTH Royal Institute of Technology, KTH Royal Institute of Technology
Abstract:
Graph Neural Networks (GNNs) have emerged as the dominant approach for machine learning on graphstructured data. However, concerns have arisen regarding the vulnerability of GNNs to small adversarial perturbations. Existing defense methods against such perturbations suffer from high time complexity and can negatively impact the model's performance on clean graphs. To address these challenges, this paper introduces NoisyGNNs, a novel defense method that incorporates noise into the underlying model's architecture. We establish a theoretical connection between noise injection and the enhancement of GNN robustness, highlighting the effectiveness of our approach. We further conduct extensive empirical evaluations on the node classification task to validate our theoretical findings, focusing on two popular GNNs: the GCN and GIN. The results demonstrate that NoisyGNN achieves superior or comparable defense performance to existing methods while minimizing added time complexity. The NoisyGNN approach is model-agnostic, allowing it to be integrated with different GNN architectures. Successful combinations of our NoisyGNN approach with existing defense techniques demonstrate even further improved adversarial defense results. Our code is publicly available at: https://github.com/Sennadir/NoisyGNN.



Paperid:2338
Authors:Linkun Fan, Fazhi He, Tongzhen Si, Wei Tang, Bing Li
School of Computer Science, Wuhan University, School of Computer Science, Wuhan University, School of Information Science and Engineering, University of Ninan, School of Computer Science, Wuhan University, School of Computer Science, Wuhan University Hubei Luojia Laboratory
Abstract:
3D point cloud has been wildly used in security crucial domains, such as selfdriving and 3D face recognition. Backdoor attack is a serious threat that usually destroy Deep Neural Networks (DNN) in the training stage. Though a few 3D backdoor attacks are designed to achieve guaranteed attack efficiency, their deformation will alarm human inspection. To obtain invisible backdoored point cloud, this paper proposes a novel 3D backdoor attack, named IBAPC, which generates backdoor trigger in the graph spectral domain. The effectiveness is grounded by the advantage of graph spectral signal that it can induce both global structure and local points to be responsible for the caused deformation in spatial domain. In detail, a new backdoor implanting function is proposed whose aim is to transform point cloud to graph spectral signal for conducting backdoor trigger. Then, we design a backdoor training procedure which updates the parameter of backdoor implanting function and victim 3D DNN alternately. Finally, the backdoored 3D DNN and its associated backdoor implanting function is obtained by finishing the backdoor training procedure. Experiment results suggest that IBAPC achieves SOTA attack stealthiness from three aspects including objective distance measurement, subjective human evaluation, graph spectral signal residual. At the same time, it obtains competitive attack efficiency. The code is available at https://github.com/f-lk/IBAPC.



Paperid:2339
Authors:Shuai Feng, Pengsheng Jin, Chongjun Wang
Nanjing University, Nanjing University, Nanjing University
Abstract:
Detecting outof-distribution (OOD) inputs is critical for reliable machine learning, but deep neural networks often make overconfident predictions, even for OOD inputs that deviate from the distribution of training data. Prior methods relied on the widely used softmax cross-entropy (CE) loss that is adequate for classifying in-distribution (ID) samples but not optimally designed for OOD detection. To address this issue, we propose CASE, a simple and effective OOD detection method by explicitly improving intra-class Compactness And inter-class Separability of feature Embeddings. To enhance the separation between ID and OOD samples, CASE uses a dual-loss framework, which includes a separability loss that maximizes the inter-class Euclidean distance to promote separability among different class centers, along with a compactness loss that minimizes the intra-class Euclidean distance to encourage samples to be close to their class centers. In particular, the class centers are defined as a free optimization parameter of the model and updated by gradient descent, which is simple and further enhances the OOD detection performance. Extensive experiments demonstrate the superiority of CASE, which reduces the average FPR95 by 37.11% and improves the average AUROC by 15.89% compared to the baseline method using a softmax confidence score on the more challenging CIFAR-100 model.



Paperid:2340
Authors:Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi, Kfir Levy, Shie Mannor
Technion - Israel Institute of Technology, MILA, Université de Montréal, Technion - Israel Institute of Technology, IMT Atlantique, Technion - Israel Institute of Technology, Technion - Israel Institute of Technology NVIDIA Research
Abstract:
In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This socalled rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an alpha-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.



Paperid:2341
Authors:Shangding Gu, Bilgehan Sel, Yuhao Ding, Lu Wang, Qingwei Lin, Ming Jin, Alois Knoll
Technical University of Munich, Virginia Tech, University of California - Berkeley, MIcrosoft, Microsoft Research, Virginia Tech, Technical University of Munich
Abstract:
Ensuring the safety of Reinforcement Learning (RL) is crucial for its deployment in realworld applications. Nevertheless, managing the trade-off between reward and safety during exploration presents a significant challenge. Improving reward performance through policy adjustments may adversely affect safety performance. In this study, we aim to address this conflicting relation by leveraging the theory of gradient manipulation. Initially, we analyze the conflict between reward and safety gradients. Subsequently, we tackle the balance between reward and safety optimization by proposing a soft switching policy optimization method, for which we provide convergence analysis. Based on our theoretical examination, we provide a safe RL framework to overcome the aforementioned challenge, and we develop a Safety-MuJoCo Benchmark to assess the performance of safe RL algorithms. Finally, we evaluate the effectiveness of our method on the Safety-MuJoCo Benchmark and a popular safe benchmark, Omnisafe. Experimental results demonstrate that our algorithms outperform several state-of-the-art baselines in terms of balancing reward and safety optimization.



Paperid:2342
Authors:Yin Gu, Kai Zhang, Qi Liu, Weibo Gao, Longfei Li, Jun Zhou
Anhui Province Key Laboratory of Big Data Analysis and Application, School of Data Science & School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application, School of Data Science & School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application, School of Data Science & School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Anhui Province Key Laboratory of Big Data Analysis and Application, School of Data Science & School of Computer Science and Technology, University of Science and Technology of China State Key Laboratory of Cognitive Intelligence, Ant Financial Services Group, Ant Financial Services Group
Abstract:
The recent advancements in Deep Reinforcement Learning (DRL) have significantly enhanced the performance of adaptive Traffic Signal Control (TSC). However, DRL policies are typically represented by neural networks, which are overparameterized black-box models. As a result, the learned policies often lack interpretability, and cannot be deployed directly in the real-world edge hardware due to resource constraints. In addition, the DRL methods often exhibit limited generalization performance, struggling to generalize the learned policy to other geographical regions. These factors limit the practical application of learning-based approaches. To address these issues, we suggest the use of an inherently interpretable program for representing the control policy. We present a new approach, Programmatic Interpretable reinforcement learning for traffic signal control (π-light), designed to autonomously discover non-differentiable programs. Specifically, we define a Domain Specific Language (DSL) and transformation rules for constructing programs, and utilize Monte Carlo Tree Search (MCTS) to find the optimal program in a discrete space. Extensive experiments demonstrate that our method consistently outperforms baseline approaches. Moreover, π-Light exhibits superior generalization capabilities compared to DRL, enabling training and evaluation across intersections from different cities. Finally, we analyze how the learned program policies can directly deploy on edge devices with extremely limited resources.



Paperid:2343
Authors:Riccardo Guidotti, Anna Monreale, Mattia Setzu, Giulia Volpi
University of Pisa, Pisa, Italy ISTI-CNR, Pisa, Italy, University of Pisa, Pisa, Italy, University of Pisa, Pisa, Italy, University of Pisa, Pisa, Italy
Abstract:
Decision trees are among the most popular supervised models due to their interpretability and knowledge representation resembling human reasoning. Commonlyused decision tree induction algorithms are based on greedy top-down strategies. Although these approaches are known to be an efficient heuristic, the resulting trees are only locally optimal and tend to have overly complex structures. On the other hand, optimal decision tree algorithms attempt to create an entire decision tree at once to achieve global optimality. We place our proposal between these approaches by designing a generative model for decision trees. Our method first learns a latent decision tree space through a variational architecture using pre-trained decision tree models. Then, it adopts a genetic procedure to explore such latent space to find a compact decision tree with good predictive performance. We compare our proposal against classical tree induction methods, optimal approaches, and ensemble models. The results show that our proposal can generate accurate and shallow, i.e., interpretable, decision trees.



Paperid:2344
Authors:Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak
University of Twente, University of Colorado Boulder, University of Liverpool, University of Colorado Boulder, University of Colorado Boulder, University of Liverpool
Abstract:
Regular decision processes (RDPs) are a subclass of nonMarkovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct representation of non-Markovian decision processes, their expressive power coincides with finite-state Markov decision processes (MDPs). We introduce omega-regular decision processes (ODPs) where the non-Markovian aspect of the transition and reward functions are extended to an omega-regular lookahead over the system evolution. Semantically, these lookaheads can be considered as promises made by the decision maker or the learning agent about her future behavior. In particular, we assume that, if the promised lookaheads are not met, then the payoff to the decision maker is falsum (least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discounted-reward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction.



Paperid:2345
Authors:Zayd Hammoudeh, Daniel Lowd
University of Oregon Qualtrics, Inc., University of Oregon
Abstract:
Sparse or L0 adversarial attacks arbitrarily perturb an unknown subset of the features. L0 robustness analysis is particularly wellsuited for heterogeneous (tabular) data where features have different types or scales. State-of-the-art L0 certified defenses are based on randomized smoothing and apply to evasion attacks only. This paper proposes feature partition aggregation (FPA) -- a certified defense against the union of L0 evasion, backdoor, and poisoning attacks. FPA generates its stronger robustness guarantees via an ensemble whose submodels are trained on disjoint feature sets. Compared to state-of-the-art L0 defenses, FPA is up to 3,000x faster and provides larger median robustness guarantees (e.g., median certificates of 13 pixels over 10 for CIFAR10, 12 pixels over 10 for MNIST, 4 features over 1 for Weather, and 3 features over 1 for Ames), meaning FPA provides the additional dimensions of robustness essentially for free.



Paperid:2346
Authors:SeungHoo Hong, Juhun Lee, Simon S. Woo
Sungkyunkwan University (SKKU), Sungkyunkwan University (SKKU), Sungkyunkwan University (SKKU)
Abstract:
Textto-Image models such as Stable Diffusion have shown impressive image generation synthesis, thanks to the utilization of large-scale datasets. However, these datasets may contain sexually explicit, copyrighted, or undesirable content, which allows the model to directly generate them. Given that retraining these large models on individual concept deletion requests is infeasible, fine-tuning algorithms have been developed to tackle concept erasing in diffusion models. While these algorithms yield good concept erasure, they all present one of the following issues: 1) the corrupted feature space yields synthesis of disintegrated objects, 2) the initially synthesized content undergoes a divergence in both spatial structure and semantics in the generated images, and 3) sub-optimal training updates heighten the model's susceptibility to utility harm. These issues severely degrade the original utility of generative models. In this work, we present a new approach that solves all of these challenges. We take inspiration from the concept of classifier guidance and propose a surgical update on the classifier guidance term while constraining the drift of the unconditional score term. Furthermore, our algorithm empowers the user to select an alternative to the erasing concept, allowing for more controllability. Our experimental results show that our algorithm not only erases the target concept effectively but also preserves the model’s generation capability.



Paperid:2347
Authors:Pei Huang, Haoze Wu, Yuting Yang, Ieva Daukantas, Min Wu, Yedi Zhang, Clark Barrett
Stanford University, Stanford, CA, USA, Stanford University, Stanford, CA, USA, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, IT University of Copenhagen, Copenhagen, Denmark, Stanford University, Stanford, CA, USA, National University of Singapore, Singapore, Stanford University, Stanford, CA, USA
Abstract:
Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models, providing more efficient ondevice inference with less power and memory. In this work, we propose a framework for formally verifying the properties of quantized neural networks. Our baseline technique is based on integer linear programming which guarantees both soundness and completeness. We then show how efficiency can be improved by utilizing gradient-based heuristic search methods and also bound-propagation techniques. We evaluate our approach on perception networks quantized with PyTorch. Our results show that we can verify quantized networks with better scalability and efficiency than the previous state of the art.



Paperid:2348
Authors:Qihan Huang, Jie Song, Jingwen Hu, Haofei Zhang, Yong Wang, Mingli Song
Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, State Grid Shandong Electric Power Company, Zhejiang University
Abstract:
Concept Bottleneck Models (CBMs), which break down the reasoning process into the inputto-concept mapping and the concept-to-label prediction, have garnered significant attention due to their remarkable interpretability achieved by the interpretable concept bottleneck. However, despite the transparency of the concept-to-label prediction, the mapping from the input to the intermediate concept remains a black box, giving rise to concerns about the trustworthiness of the learned concepts (i.e., these concepts may be predicted based on spurious cues). The issue of concept untrustworthiness greatly hampers the interpretability of CBMs, thereby hindering their further advancement. To conduct a comprehensive analysis on this issue, in this study we establish a benchmark to assess the trustworthiness of concepts in CBMs. A pioneering metric, referred to as concept trustworthiness score, is proposed to gauge whether the concepts are derived from relevant regions. Additionally, an enhanced CBM is introduced, enabling concept predictions to be made specifically from distinct parts of the feature map, thereby facilitating the exploration of their related regions. Besides, we introduce three modules, namely the cross-layer alignment (CLA) module, the cross-image alignment (CIA) module, and the prediction alignment (PA) module, to further enhance the concept trustworthiness within the elaborated CBM. The experiments on five datasets across ten architectures demonstrate that without using any concept localization annotations during training, our model improves the concept trustworthiness by a large margin, meanwhile achieving superior accuracy to the state-of-the-arts. Our code is available at https://github.com/hqhQAQ/ProtoCBM.



Paperid:2349
Authors:Yihao Huang, Felix Juefei-Xu, Qing Guo, Jie Zhang, Yutong Wu, Ming Hu, Tianlin Li, Geguang Pu, Yang Liu
Nanyang Technological University, Singapore, New York University, USA, CFAR and IHPC, Agency for Science, Technology and Research (A*STAR), Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, Nanyang Technological University, Singapore, East China Normal University, China, Nanyang Technological University, Singapore
Abstract:
Although recent personalization methods have democratized highresolution image synthesis by enabling swift concept acquisition with minimal examples and lightweight computation, they also present an exploitable avenue for highly accessible backdoor attacks. This paper investigates a critical and unexplored aspect of text-to-image (T2I) diffusion models - their potential vulnerability to backdoor attacks via personalization. By studying the prompt processing of popular personalization methods (epitomized by Textual Inversion and DreamBooth), we have devised dedicated personalization-based backdoor attacks according to the different ways of dealing with unseen tokens and divide them into two families: nouveau-token and legacy-token backdoor attacks. In comparison to conventional backdoor attacks involving the fine-tuning of the entire text-to-image diffusion model, our proposed personalization-based backdoor attack method can facilitate more tailored, efficient, and few-shot attacks. Through comprehensive empirical study, we endorse the utilization of the nouveau-token backdoor attack due to its impressive effectiveness, stealthiness, and integrity, markedly outperforming the legacy-token backdoor attack.



Paperid:2350
Authors:Samyak Jain, Tanima Dutta
Indian Institute of Technology (BHU) Varanasi, Indian Institute of Technology (BHU) Varanasi
Abstract:
Despite the increasing popularity of graph neural networks (GNNs), the security risks associated with their deployment have not been well explored. Existing works follow the standard adversarial attacks to maximize crossentropy loss within an L-infinity norm bound. We analyze the robustness of GNNs against node injection attacks (NIAs) in black-box settings by allowing new nodes to be injected and attacked. In this work, we propose to design stronger and transferable NIAs. First, we propose margin aware attack (MAA) that uses a maximum margin loss to generate NIAs. We then propose a novel margin and direction aware attack (MDA) that diversifies the initial directions of MAA attack by minimizing the cosine similarity of the injected nodes with respect to their respective random initialization in addition to the maximization of max-margin loss. This makes the NIAs stronger. We further observe that using L2 norm of gradients in the attack step leads to an enhanced diversity amongst the node features, thereby further enhancing the strength of the attack. We incorporate transferability in NIAs by perturbing the surrogate model before generating the attack. An analysis of eigen spectrum density of the hessian of the loss emphasizes that perturbing the weights of the surrogate model improves the transferability. Our experimental results demonstrate that the proposed resilient node injection attack (R-NIA) consistently outperform PGD by margins about 7-15% on both large and small graph datasets. R-NIA is significantly stronger and transferable than existing NIAs on graph robustness benchmarks.



Paperid:2351
Authors:Zhuangzhuang Jia, Grani A. Hanasusanto, Phebe Vayanos, Weijun Xie
University of Illinois Urbana-Champaign, University of Illinois Urbana-Champaign, University of Southern California, Georgia Institute of Technology
Abstract:
We consider the problem of learning fair policies for multistage selection problems from observational data. This problem arises in several high-stakes domains such as company hiring, loan approval, or bail decisions where outcomes (e.g., career success, loan repayment, recidivism) are only observed for those selected. We propose a multi-stage framework that can be augmented with various fairness constraints, such as demographic parity or equal opportunity. This problem is a highly intractable infinite chance-constrained program involving the unknown joint distribution of covariates and outcomes. Motivated by the potential impact of selection decisions on people’s lives and livelihoods, we propose to focus on interpretable linear selection rules. Leveraging tools from causal inference and sample average approximation, we obtain an asymptotically consistent solution to this selection problem by solving a mixed binary conic optimization problem, which can be solved using standard off-the-shelf solvers. We conduct extensive computational experiments on a variety of datasets adapted from the UCI repository on which we show that our proposed approaches can achieve an 11.6% improvement in precision and a 38% reduction in the measure of unfairness compared to the existing selection policy.



Paperid:2352
Authors:Wenxiang Jiang, Hanwei Zhang, Xi Wang, Zhongwen Guo, Hao Wang
Ocean University of China, Institute of Intelligent Software, Guangzhou Saarland University, LIX, Ecole Polytechnique, CNRS, Institut Polytechnique de Paris, Ocean University of China, Norwegian University of Science and Technology, School of Cyber Engineering, Xidian University, China
Abstract:
Adversarial attacks, i.e., generating adversarial perturbations with a small magnitude to deceive deep neural networks, are important for investigating and improving model trustworthiness. Traditionally, the topic was scoped within 2D images without considering 3D multiview information. Benefiting from Neural Radiance Fields (NeRF), one can easily reconstruct a 3D scene with a MultiLayer Perceptron (MLP) from given 2D views and synthesize photo-realistic renderings of novel vantages. This opens up a door to discussing the possibility of undertaking to attack multiview NeRF network with downstream tasks from different rendering angles, which we denote Neural Radiance Fiels-based multiview adversarial Attack (NeRFail). The goal is, given one scene and a subset of views, to deceive the recognition results of agnostic view angles as well as given views. To do so, we propose a transformation mapping from pixels to 3D points such that our attack generates multiview adversarial perturbations by attacking a subset of images with different views, intending to prevent the downstream classifier from correctly predicting images rendered by NeRF from other views. Experiments show that our multiview adversarial perturbations successfully obfuscate the downstream classifier at both known and unknown views. Notably, when retraining another NeRF on the perturbed training data, we show that the perturbation can be inherited and reproduced. The code can be found at https://github.com/jiang-wenxiang/NeRFail.



Paperid:2353
Authors:Yangdi Jiang, Yi Liu, Xiaodong Yan, Anne-Sophie Charest, Linglong Kong, Bei Jiang
Department of Mathematical and Statistical Sciences, University of Alberta, Department of Mathematical and Statistical Sciences, University of Alberta, Zhongtai Securities Institute for Financial Studies, Shandong University, Department of Mathematics and Statistics, Laval University, Department of Mathematical and Statistical Sciences, University of Alberta, Department of Mathematical and Statistical Sciences, University of Alberta
Abstract:
Differentially private (DP) synthetic datasets have been receiving significant attention from academia, industry, and government. However, little is known about how to perform statistical inference using DP synthetic datasets. Naive approaches that do not take into account the induced uncertainty due to the DP mechanism will result in biased estimators and invalid inferences. In this paper, we present a class of maximum likelihood estimator (MLE)based easy-to-implement bias-corrected DP estimators with valid asymptotic confidence intervals (CI) for parameters in regression settings, by establishing the connection between additive DP mechanisms and measurement error models. Our simulation shows that our estimator has comparable performance to the widely used sufficient statistic perturbation (SSP) algorithm in some scenarios but with the advantage of releasing a synthetic dataset and obtaining statistically valid asymptotic CIs, which can achieve better coverage when compared to the naive CIs obtained by ignoring the DP mechanism.



Paperid:2354
Authors:Zhimeng Jiang, Xiaotian Han, Chao Fan, Zirui Liu, Na Zou, Ali Mostafavi, Xia Hu
Texas A&M University, Texas A&M University, Clemson University, Rice University, University of Houston, Texas A&M University, Rice University
Abstract:
There has been significant progress in improving the performance of graph neural networks (GNNs) through enhancements in graph data, model architecture design, and training strategies. For fairness in graphs, recent studies achieve fair representations and predictions through either graph data preprocessing (e.g., node feature masking, and topology rewiring) or fair training strategies (e.g., regularization, adversarial debiasing, and fair contrastive learning). How to achieve fairness in graphs from the model architecture perspective is less explored. More importantly, GNNs exhibit worse fairness performance compared to multilayer perception since their model architecture (i.e., neighbor aggregation) amplifies biases. To this end, we aim to achieve fairness via a new GNN architecture. We propose Fair Message Passing (FMP) designed within a unified optimization framework for GNNs. Notably, FMP explicitly renders sensitive attribute usage in forward propagation for node classification task using cross-entropy loss without data pre-processing. In FMP, the aggregation is first adopted to utilize neighbors' information and then the bias mitigation step explicitly pushes demographic group node presentation centers together. In this way, FMP scheme can aggregate useful information from neighbors and mitigate bias to achieve better fairness and prediction tradeoff performance. Experiments on node classification tasks demonstrate that the proposed FMP outperforms several baselines in terms of fairness and accuracy on three real-world datasets. The code is available at https://github.com/zhimengj0326/FMP.



Paperid:2355
Authors:Milad Kazemi, Mateo Perez, Fabio Somenzi, Sadegh Soudjani, Ashutosh Trivedi, Alvaro Velasquez
King's College London, UK, University of Colorado Boulder, USA, University of Colorado Boulder, USA, Max Planck Institute for Software Systems, Germany, University of Colorado Boulder, USA, University of Colorado Boulder, USA
Abstract:
We present a modular approach to reinforcement learning (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assumeguarantee paradigm where the optimal control for the individual components is synthesized in isolation by making assumptions about the behaviors of neighboring components, and providing guarantees about their own behavior. We express these assume-guarantee contracts as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of satisfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies.



Paperid:2356
Authors:Haitham Khedr, Yasser Shoukry
University of California, Irvine, University of California, Irvine
Abstract:
Formal certification of Neural Networks (NNs) is crucial for ensuring their safety, fairness, and robustness. Unfortunately, on the one hand, sound and complete certification algorithms of ReLUbased NNs do not scale to large-scale NNs. On the other hand, incomplete certification algorithms are easier to compute, but they result in loose bounds that deteriorate with the depth of NN, which diminishes their effectiveness. In this paper, we ask the following question; can we replace the ReLU activation function with one that opens the door to incomplete certification algorithms that are easy to compute but can produce tight bounds on the NN's outputs? We introduce DeepBern-Nets, a class of NNs with activation functions based on Bernstein polynomials instead of the commonly used ReLU activation. Bernstein polynomials are smooth and differentiable functions with desirable properties such as the so-called range enclosure and subdivision properties. We design a novel Interval Bound Propagation (IBP) algorithm, called Bern-IBP, to efficiently compute tight bounds on DeepBern-Nets outputs. Our approach leverages the properties of Bernstein polynomials to improve the tractability of neural network certification tasks while maintaining the accuracy of the trained networks. We conduct experiments in adversarial robustness and reachability analysis settings to assess the effectiveness of the approach. Our proposed framework achieves high certified accuracy for adversarially-trained NNs, which is often a challenging task for certifiers of ReLU-based NNs. This work establishes Bernstein polynomial activation as a promising alternative for improving NN certification tasks across various NNs applications.



Paperid:2357
Authors:Hyunjune Kim, Sangyong Lee, Simon S. Woo
Sungkyunkwan University, Sungkyunkwan University, Sungkyunkwan University
Abstract:
Recently, serious concerns have been raised about the privacy issues related to training datasets in machine learning algorithms when including personal data. Various regulations in different countries, including the GDPR grant individuals to have personal data erased, known as ‘the right to be forgotten’ or ‘the right to erasure’. However, there has been less research on effectively and practically deleting the requested personal data from the training set while not jeopardizing the overall machine learning performance. In this work, we propose a fast and novel machine unlearning paradigm at the layer level called layer attack unlearning, which is highly accurate and fast compared to existing machine unlearning algorithms. We introduce the PartialPGD algorithm to locate the samples to forget efficiently. In addition, we only use the last layer of the model inspired by the Forward-Forward algorithm for unlearning process. Lastly, we use Knowledge Distillation (KD) to reliably learn the decision boundaries from the teacher using soft label information to improve accuracy performance. We conducted extensive experiments with SOTA machine unlearning models and demonstrated the effectiveness of our approach for accuracy and end-to-end unlearning performance.



Paperid:2358
Authors:Minsu Kim, Seong-Hyeon Hwang, Steven Euijong Whang
KAIST, KAIST, KAIST
Abstract:
Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. Unfortunately, concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data possibly using ensemble techniques of previous models and tend to discard the drifted historical data. However, we contend that explicitly utilizing the drifted data together leads to much better model accuracy and propose Quilt, a datacentric framework for identifying and selecting data segments that maximize model accuracy. To address the potential downside of efficiency, Quilt extends existing data subset selection techniques, which can be used to reduce the training data without compromising model accuracy. These techniques cannot be used as is because they only assume virtual drifts where the posterior probabilities P(y|X) are assumed not to change. In contrast, a key challenge in our setup is to also discard undesirable data segments with concept drifts. Quilt thus discards drifted data segments and selects data segment subsets holistically for accurate and efficient model training. The two operations use gradient-based scores, which have little computation overhead. In our experiments, we show that Quilt outperforms state-of-the-art drift adaptation and data selection baselines on synthetic and real datasets.



Paperid:2359
Authors:Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
Tokyo Institute of Technology, MBZUAI Tokyo Institute of Technology, Tokyo Institute of Technology
Abstract:
Large Language Models (LLMs) have achieved humanlevel fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.



Paperid:2360
Authors:Matthias König, Holger H. Hoos, Jan N. van Rijn
Leiden Institute of Advanced Computer Science, Leiden University, Leiden Institute of Advanced Computer Science, Leiden University Chair for AI Methodology, RWTH Aachen University, Leiden Institute of Advanced Computer Science, Leiden University
Abstract:
Recent research has introduced several approaches to formally verify the robustness of neural network models against perturbations in their inputs, such as the ones that occur in adversarial attacks. At the same time, this particular verification task is known to be computationally challenging. More specifically, assessing the robustness of a neural network against input perturbations can easily take several hours of compute time per input vector, even when using stateof-the-art verification approaches. In light of this, it becomes challenging to select from a given set of neural network models the one that is best in terms of robust accuracy, i.e., the fraction of instances for which the model is known to be robust against adversarial perturbations, especially when given limited computing resources. To tackle this problem, we propose a racing method specifically adapted to the domain of robustness verification. This racing method utilises Delta-values, which can be seen as an efficiently computable proxy for the distance of a given input to a neural network model to the decision boundary. We present statistical evidence indicating significant differences in the empirical cumulative distribution between robust and non-robust inputs as a function of Delta-values. Using this information, we show that it is possible to reliably expose vulnerabilities in the model with relatively few input iterations. Overall, when applied to selecting the most robust network from sets of 31 MNIST and 27 CIFAR-10 networks, our proposed method achieves speedups of a factor of 108 and 42, respectively, in terms of cumulative running time compared to standard local robustness verification on the complete testing sets.



Paperid:2361
Authors:Merlijn Krale, Thiago D. Simão, Jana Tumova, Nils Jansen
Radboud University, Nijmegen, Eindhoven University of Technology, KTH Royal Institute of Technology, Stockholm, Radboud University, Nijmegen Ruhr-University Bochum
Abstract:
Partial observability and uncertainty are common problems in sequential decisionmaking that particularly impede the use of formal models such as Markov decision processes (MDPs). However, in practice, agents may be able to employ costly sensors to measure their environment and resolve partial observability by gathering information. Moreover, imprecise transition functions can capture model uncertainty. We combine these concepts and extend MDPs to robust active-measuring MDPs (RAM-MDPs). We present an active-measure heuristic to solve RAM-MDPs efficiently and show that model uncertainty can, counterintuitively, let agents take fewer measurements. We propose a method to counteract this behavior while only incurring a bounded additional cost. We empirically compare our methods to several baselines and show their superior scalability and performance.



Paperid:2362
Authors:Bo-Han Kung, Shang-Tse Chen
National Taiwan University, National Taiwan University
Abstract:
Randomized smoothing is currently the stateof-the-art method that provides certified robustness for deep neural networks. However, due to its excessively conservative nature, this method of incomplete verification often cannot achieve an adequate certified radius on real-world datasets. One way to obtain a larger certified radius is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on CIFAR-10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead.



Paperid:2363
Authors:Brody Kutt, Pralay Ramteke, Xavier Mignot, Pamela Toman, Nandini Ramanan, Sujit Rokka Chhetri, Shan Huang, Min Du, William Hewlett
Palo Alto Networks, AI Research, Palo Alto Networks, AI Research, Palo Alto Networks, AI Research, Palo Alto Networks, AI Research, Palo Alto Networks, AI Research, Palo Alto Networks, AI Research, Palo Alto Networks, AI Research, Palo Alto Networks, AI Research, Palo Alto Networks, AI Research
Abstract:
Producing labels for unlabeled data is errorprone, making semi-supervised learning (SSL) troublesome. Often, little is known about when and why an algorithm fails to outperform a supervised baseline. Using benchmark datasets, we craft five common real-world SSL data scenarios: few-label, open-set, noisy-label, and class distribution imbalance/misalignment in the labeled and unlabeled sets. We propose a novel algorithm called Contrastive Credibility Propagation (CCP) for deep SSL via iterative transductive pseudo-label refinement. CCP unifies semi-supervised learning and noisy label learning for the goal of reliably outperforming a supervised baseline in any data scenario. Compared to prior methods which focus on a subset of scenarios, CCP uniquely outperforms the supervised baseline in all scenarios, supporting practitioners when the qualities of labeled or unlabeled data are unknown.



Paperid:2364
Authors:Tobias Ladner, Matthias Althoff
Technical University of Munich, Germany, Technical University of Munich, Germany
Abstract:
Formal verification of neural networks is a challenging problem due to the complexity and nonlinearity of neural networks. It has been shown that polynomial zonotopes can tightly enclose the output set of a neural network. Unfortunately, the tight enclosure comes with additional complexity in the set representation, thus, rendering subsequent operations expensive to compute, such as computing interval bounds and intersection checking. To address this issue, we present a novel approach to restructure a polynomial zonotope to tightly enclose the original polynomial zonotope while drastically reducing its complexity. The restructuring is achieved by relaxing the exponents of the dependent factors of polynomial zonotopes and finding an appropriate approximation error. We demonstrate the applicability of our approach on output sets of neural networks, where we obtain tighter results in various subsequent operations, such as order reduction, zonotope enclosure, and range bounding.



Paperid:2365
Authors:Tobias Leemann, Martin Pawelczyk, Christian Thomas Eberle, Gjergji Kasneci
University of Tübingen, Technical Unversity of Munich, Harvard University, University of Tübingen, Technical Unversity of Munich
Abstract:
We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decisionmaking system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. We observe that privacy and performance are not fundamentally at odds with each other and that it is possible for a decision maker to benefit from additional data while respecting users' consent. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on challenging real datasets, tasks, and models.



Paperid:2366
Authors:Francesco Leofante, Nico Potyka
Department of Computing, Imperial College London, UK, School of Computer Science and Informatics, Cardiff University, UK
Abstract:
Counterfactual explanations shed light on the decisions of blackbox models by explaining how an input can be altered to obtain a favourable decision from the model (e.g., when a loan application has been rejected). However, as noted recently, counterfactual explainers may lack robustness in the sense that a minor change in the input can cause a major change in the explanation. This can cause confusion on the user side and open the door for adversarial attacks. In this paper, we study some sources of non-robustness. While there are fundamental reasons for why an explainer that returns a single counterfactual cannot be robust in all instances, we show that some interesting robustness guarantees can be given by reporting multiple rather than a single counterfactual. Unfortunately, the number of counterfactuals that need to be reported for the theoretical guarantees to hold can be prohibitively large. We therefore propose an approximation algorithm that uses a diversity criterion to select a feasible number of most relevant explanations and study its robustness empirically. Our experiments indicate that our method improves the state-of-the-art in generating robust explanations, while maintaining other desirable properties and providing competitive computational performance.



Paperid:2367
Authors:Fangqi Li, Haodong Zhao, Wei Du, Shilin Wang
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University
Abstract:
To trace the copyright of deep neural networks, an owner can embed its identity information into its model as a watermark. The capacity of the watermark quantify the maximal volume of information that can be verified from the watermarked model. Current studies on capacity focus on the ownership verification accuracy under ordinary removal attacks and fail to capture the relationship between robustness and fidelity. This paper studies the capacity of deep neural network watermarks from an information theoretical perspective. We propose a new definition of deep neural network watermark capacity analogous to channel capacity, analyze its properties, and design an algorithm that yields a tight estimation of its upper bound under adversarial overwriting. We also propose a universal noninvasive method to secure the transmission of the identity message beyond capacity by multiple rounds of ownership verification. Our observations provide evidence for neural network owners and defenders that are curious about the tradeoff between the integrity of their ownership and the performance degradation of their products.



Paperid:2368
Authors:Xinke Li, Junchi Lu, Henghui Ding, Changsheng Sun, Joey Tianyi Zhou, Yeow Meng Chee
National University of Singapore, Nanyang Technological University, Nanyang Technological University, National University of Singapore, Institute of High Performance Computing (IHPC), A*STAR Centre for Frontier AI Research (CFAR), A*STAR, National University of Singapore
Abstract:
With the growth of 3D sensing technology, the deep learning system for 3D point clouds has become increasingly important, especially in applications such as autonomous vehicles where safety is a primary concern. However, there are growing concerns about the reliability of these systems when they encounter noisy point clouds, either occurring naturally or introduced with malicious intent. This paper highlights the challenges of point cloud classification posed by various forms of noise, from simple background noise to malicious adversarial/backdoor attacks that can intentionally skew model predictions. While there's an urgent need for optimized point cloud denoising, current point outlier removal approaches, an essential step for denoising, rely heavily on handcrafted strategies and are not adapted for higherlevel tasks, such as classification. To address this issue, we introduce an innovative point outlier cleansing method that harnesses the power of downstream classification models. Using gradient-based attribution analysis, we define a novel concept: point risk. Drawing inspiration from tail risk minimization in finance, we recast the outlier removal process as an optimization problem, named PointCVaR. Extensive experiments show that our proposed technique not only robustly filters diverse point cloud outliers but also consistently and significantly enhances existing robust methods for point cloud classification. A notable feature of our approach is its effectiveness in defending against the latest threat of backdoor attacks in point clouds.



Paperid:2369
Authors:Shuang Liu, Yihan Wang, Xiao-Shan Gao
Academy of Mathematics and Systems Science, Chinese Academy of Sciences University of Chinese Academy of Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences University of Chinese Academy of Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences University of Chinese Academy of Sciences
Abstract:
Unlearnable example attacks are data poisoning attacks aiming to degrade the clean test accuracy of deep learning by adding imperceptible perturbations to the training samples, which can be formulated as a bilevel optimization problem. However, directly solving this optimization problem is intractable for deep neural networks. In this paper, we investigate unlearnable example attacks from a game-theoretic perspective, by formulating the attack as a nonzero sum Stackelberg game. First, the existence of game equilibria is proved under the normal setting and the adversarial training setting. It is shown that the game equilibrium gives the most powerful poison attack in that the victim has the lowest test accuracy among all networks within the same hypothesis space when certain loss functions are used. Second, we propose a novel attack method, called the Game Unlearnable Example (GUE), which has three main gradients. (1) The poisons are obtained by directly solving the equilibrium of the Stackelberg game with a first-order algorithm. (2) We employ an autoencoder-like generative network model as the poison attacker. (3) A novel payoff function is introduced to evaluate the performance of the poison. Comprehensive experiments demonstrate that GUE can effectively poison the model in various scenarios. Furthermore, the GUE still works by using a relatively small percentage of the training data to train the generator, and the poison generator can generalize to unseen data well. Our implementation code can be found at https://github.com/hong-xian/gue.



Paperid:2370
Authors:Tao Liu, Yuhang Zhang, Zhu Feng, Zhiqin Yang, Chen Xu, Dapeng Man, Wu Yang
College of Computer Science and Technology, Harbin Engineering University, China, College of Computer Science and Technology, Harbin Engineering University, China, College of Computer Science and Technology, Harbin Engineering University, China, Southampton Ocean Engineering Joint Institute, Harbin Engineering University, China, College of Computer Science and Technology, Harbin Engineering University, China, College of Computer Science and Technology, Harbin Engineering University, China, College of Computer Science and Technology, Harbin Engineering University, China
Abstract:
Backdoors on federated learning will be diluted by subsequent benign updates. This is reflected in the significant reduction of attack success rate as iterations increase, ultimately failing. We use a new metric to quantify the degree of this weakened backdoor effect, called attack persistence. Given that research to improve this performance has not been widely noted, we propose a Full Combination Backdoor Attack (FCBA) method. It aggregates more combined trigger information for a more complete backdoor pattern in the global model. Trained backdoored global model is more resilient to benign updates, leading to a higher attack success rate on the test set. We test on three datasets and evaluate with two models across various settings. FCBA's persistence outperforms SOTA federated learning backdoor attacks. On GTSRB, postattack 120 rounds, our attack success rate rose over 50% from baseline. The core code of our method is available at https://github.com/PhD-TaoLiu/FCBA.



Paperid:2371
Authors:Yuxiao Lu, Arunesh Sinha, Pradeep Varakantham
Singpore Management University, Rutgers University, Singapore Management University
Abstract:
Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks. In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as robots cleaning different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent (which computes a reward maximizing policy from a given start to a far away goal state while satisfying cost constraints) with a lowlevel goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoSHRL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR) and can adjust to flexible constraint thresholds without retraining. We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading approaches in constrained and hierarchical RL.



Paperid:2372
Authors:Dwarikanath Mahapatra, Behzad Bozorgtabar, Zongyuan Ge, Mauricio Reyes, Jean-Philippe Thiran
Inception Institute of Artificial Intelligence, (EPFL) École Polytechnique Fédérale de Lausanne, Monash University, University of Bern, (EPFL) École Polytechnique Fédérale de Lausanne
Abstract:
Informative sample selection in active learning (AL) helps a machine learning system attain optimum performance with minimum labeled samples, thus improving humanin-the-loop computer-aided diagnosis systems with limited labeled data. Data augmentation is highly effective for enlarging datasets with less labeled data. Combining informative sample selection and data augmentation should leverage their respective advantages and improve performance of AL systems. We propose a novel approach to combine informative sample selection and data augmentation for multi-label active learning. Conventional informative sample selection approaches have mostly focused on the single-label case which do not perform optimally in the multi-label setting. We improve upon state-of-the-art multi-label active learning techniques by representing disease labels as graph nodes, use graph attention transformers (GAT) to learn more effective inter-label relationships and identify most informative samples. We generate transformations of these informative samples which are also informative. Experiments on public chest xray datasets show improved results over state-of-the-art multi-label AL techniques in terms of classification performance, learning rates, and robustness. We also perform qualitative analysis to determine the realism of generated images.



Paperid:2373
Authors:Luca Marzari, Davide Corsi, Enrico Marchesini, Alessandro Farinelli, Ferdinando Cicalese
University of Verona, University of Verona, Massachusetts Institute of Technology, University of Verona, University of Verona
Abstract:
Identifying safe areas is a key point to guarantee trust for systems that are based on Deep Neural Networks (DNNs). To this end, we introduce the AllDNNVerification problem: given a safety property and a DNN, enumerate the set of all the regions of the property input domain which are safe, i.e., where the property does hold. Due to the #P-hardness of the problem, we propose an efficient approximation method called ε-ProVe. Our approach exploits a controllable underestimation of the output reachable sets obtained via statistical prediction of tolerance limits, and can provide a tight —with provable probabilistic guarantees— lower estimate of the safe areas. Our empirical evaluation on different standard benchmarks shows the scalability and effectiveness of our method, offering valuable insights for this new type of verification of DNNs.



Paperid:2374
Authors:Shuyu Miao, Jian Liu, Lin Zheng, Hong Jin
Ant Group, Ant Group, Ant Group, Ant Group
Abstract:
Artificial Intelligence (AI) models have become an integral part of modern society, significantly improving human lives. However, ensuring the reliability and safety of these models is of paramount importance. One critical aspect is the continuous monitoring and verification of model performance to prevent any potential risks. Realtime online evaluation of AI models is necessary to maintain their effectiveness and mitigate any harm caused by performance degradation. The traditional approach to model evaluation involves supervised methods that rely on manual labeling to compare results with model predictions. Unfortunately, this method is not suitable for online model monitoring due to its inherent lag and high cost. While there have been attempts to explore free-label model evaluation, these approaches often consider only the global features of the entire dataset. Additionally, they can only perform model evaluation based on a single dimension of model confidence or features. In this paper, we propose a novel approach called Divide-and-Aggregate Learning (DAL) for unsupervised model evaluation. Our method addresses the limitations of previous approaches by dividing the output of the model into buckets, capturing local information of the distribution. We then aggregate this local information to obtain global information and further represent the relationship between the distribution and model performance. Importantly, our method can simultaneously handle the confidence distribution and feature distribution of the model output. Extensive experiments have been conducted to demonstrate the effectiveness of our DAL model. The results show that our approach outperforms previous methods on four widely used datasets. We will make our source code publicly available.



Paperid:2375
Authors:Abhijit Mishra, Mingda Li, Soham Deo
University of Texas at Austin, University of Texas at Austin, University of Texas at Austin
Abstract:
This paper addresses the privacy and security concerns associated with deep neural language models, which serve as crucial components in various modern AIbased applications. These models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. However, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. To address this, we propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text. The original pre-trained language model first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. This enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. After adaptation, models are fine-tuned on encrypted versions of existing training datasets. Experimental evaluation employing adapted versions of renowned models (e.g., BERT, RoBERTa) across established benchmark English and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. This serves to safeguard performance, privacy, and security cohesively.



Paperid:2376
Authors:Rohan Mitta, Hosein Hasanbeig, Jun Wang, Daniel Kroening, Yiannis Kantaros, Alessandro Abate
University of Oxford, Oxford, UK, Microsoft Research, Washington University in St. Louis, Amazon, Washington University in St. Louis, University of Oxford, Oxford, UK
Abstract:
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. As enforcing safety during training might severely limit the agent’s exploration, we propose here a new architecture that handles the tradeoff between efficient progress and safety during exploration. As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. We then propose a way to approximate moments of belief about the risk associated to the action selection policy. We demonstrate that this approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture.



Paperid:2377
Authors:Saemi Moon, Seunghyuk Cho, Dongwoo Kim
POSTECH, CSE, POSTECH, GSAI, POSTECH, CSE POSTECH, GSAI
Abstract:
We tackle the problem of feature unlearning from a pretrained image generative model: GANs and VAEs. Unlike a common unlearning task where an unlearning target is a subset of the training set, we aim to unlearn a specific feature, such as hairstyle from facial images, from the pre-trained generative models. As the target feature is only presented in a local region of an image, unlearning the entire image from the pre-trained model may result in losing other details in the remaining region of the image. To specify which features to unlearn, we collect randomly generated images that contain the target features. We then identify a latent representation corresponding to the target feature and then use the representation to fine-tune the pre-trained model. Through experiments on MNIST, CelebA, and FFHQ datasets, we show that target features are successfully removed while keeping the fidelity of the original models. Further experiments with an adversarial attack show that the unlearned model is more robust under the presence of malicious parties.



Paperid:2378
Authors:Ronghui Mu, Leandro Soriano Marcolino, Yanghao Zhang, Tianle Zhang, Xiaowei Huang, Wenjie Ruan
University of Liverpool, Lancaster University, University of Liverpool, University of Liverpool, University of Liverpool, University of Liverpool
Abstract:
Reinforcement Learning (RL) has achieved remarkable success in safetycritical areas, but it can be weakened by adversarial attacks. Recent studies have introduced ``smoothed policies" to enhance its robustness. Yet, it is still challenging to establish a provable guarantee to certify the bound of its total reward. Prior methods relied primarily on computing bounds using Lipschitz continuity or calculating the probability of cumulative reward being above specific thresholds. However, these techniques are only suited for continuous perturbations on the RL agent's observations and are restricted to perturbations bounded by the l2-norm. To address these limitations, this paper proposes a general black-box certification method, called ReCePS, which is capable of directly certifying the cumulative reward of the smoothed policy under various lp-norm bounded perturbations. Furthermore, we extend our methodology to certify perturbations on action spaces. Our approach leverages f-divergence to measure the distinction between the original distribution and the perturbed distribution, subsequently determining the certification bound by solving a convex optimisation problem. We provide a comprehensive theoretical analysis and run experiments in multiple environments. Our results show that our method not only improves the tightness of certified lower bound of the mean cumulative reward but also demonstrates better efficiency than state-of-the-art methods.



Paperid:2379
Authors:Xin Mu, Yu Wang, Zhengan Huang, Junzuo Lai, Yehong Zhang, Hui Wang, Yue Yu
Peng Cheng Laboratory, Peng Cheng Laboratory, Peng Cheng Laboratory, Jinan University, Peng Cheng Laboratory, Peng Cheng Laboratory, Peng Cheng Laboratory
Abstract:
In the rapidly growing digital economy, protecting intellectual property (IP) associated with digital products has become increasingly important. Within this context, machine learning (ML) models, being highly valuable digital assets, have gained significant attention for IP protection. This paper introduces a practical encryptionbased framework called EncryIP, which seamlessly integrates a public-key encryption scheme into the model learning process. This approach enables the protected model to generate randomized and confused labels, ensuring that only individuals with accurate secret keys, signifying authorized users, can decrypt and reveal authentic labels. Importantly, the proposed framework not only facilitates the protected model to multiple authorized users without requiring repetitive training of the original ML model with IP protection methods but also maintains the model's performance without compromising its accuracy. Compared to existing methods like watermark-based, trigger-based, and passport-based approaches, EncryIP demonstrates superior effectiveness in both training protected models and efficiently detecting the unauthorized spread of ML models.



Paperid:2380
Authors:Alireza Nadali, Vishnu Murali, Ashutosh Trivedi, Majid Zamani
University of Colorado Boulder, University of Colorado Boulder, University of Colorado Boulder, University of Colorado Boulder
Abstract:
Notions of transition invariants and closure certificates have seen recent use in the formal verification of controlled dynamical systems against \omegaregular properties. Unfortunately, existing approaches face limitations in two directions. First, they require a closed-form mathematical expression representing the model of the system. Such an expression may be difficult to find, too complex to be of any use, or unavailable due to security or privacy constraints. Second, finding such invariants typically rely on optimization techniques such as sum-of-squares (SOS) or satisfiability modulo theory (SMT) solvers. This restricts the classes of systems that need to be formally verified. To address these drawbacks, we introduce a notion of neural closure certificates. We present a data-driven algorithm that trains a neural network to represent a closure certificate. Our approach is formally correct under some mild assumptions, i.e., one is able to formally show that the unknown system satisfies the \omega-regular property of interest if a neural closure certificate can be computed. Finally, we demonstrate the efficacy of our approach with relevant case studies.



Paperid:2381
Authors:Manish Nagireddy, Lamogha Chiazor, Moninder Singh, Ioana Baldini
IBM Research, IBM Research, IBM Research, IBM Research
Abstract:
Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. Taking inspiration from social science research, we start with a documented list of 93 UScentric stigmas and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two open source generative language models and we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We demonstrate that the deliberate design of the templates in our benchmark (e.g., adding biasing text to the prompt or using different verbs that change the answer that indicates bias) impacts the model tendencies to generate socially biased output. Additionally, through manual evaluation, we discover problematic patterns in the generated chain-of-thought output that range from subtle bias to lack of reasoning. Warning: This paper contains examples of text which are toxic, biased, and potentially harmful.



Paperid:2382
Authors:Dexter Neo, Stefan Winkler, Tsuhan Chen
National University of Singapore, National University of Singapore, National University of Singapore
Abstract:
We present a new loss function that addresses the outof-distribution (OOD) network calibration problem. While many objective functions have been proposed to effectively calibrate models in-distribution, our findings show that they do not always fare well OOD. Based on the Principle of Maximum Entropy, we incorporate helpful statistical constraints observed during training, delivering better model calibration without sacrificing accuracy. We provide theoretical analysis and show empirically that our method works well in practice, achieving state-of-the-art calibration on both synthetic and real-world benchmarks. Our code is available at https://github.com/dexterdley/MaxEnt-Loss.



Paperid:2383
Authors:Minheng Ni, Chenfei Wu, Xiaodong Wang, Shengming Yin, Lijuan Wang, Zicheng Liu, Nan Duan
Microsoft Research Asia, Microsoft Research Asia, Microsoft Research Asia, Microsoft Research Asia, Microsoft Azure AI, Microsoft Azure AI, Microsoft Research Asia
Abstract:
Avoiding synthesizing specific visual concepts is an essential challenge in responsible visual synthesis. However, the visual concept that needs to be avoided for responsible visual synthesis tends to be diverse, depending on the region, context, and usage scenarios. In this work, we formalize a new task, Openvocabulary Responsible Visual Synthesis (ORES), where the synthesis model is able to avoid forbidden visual concepts while allowing users to input any desired content. To address this problem, we present a Two-stage Intervention (TIN) framework. By introducing 1) rewriting with learnable instruction through a large-scale language model (LLM) and 2) synthesizing with prompt intervention on a diffusion synthesis model, it can effectively synthesize images avoiding any concepts but following the user's query as much as possible. To evaluate on ORES, we provide a publicly available dataset, baseline models, and benchmark. Experimental results demonstrate the effectiveness of our method in reducing risks of image generation. Our work highlights the potential of LLMs in responsible visual synthesis. Our code and dataset is public available in https://github.com/kodenii/ORES.



Paperid:2384
Authors:Thomas Norrenbrock, Marco Rudolph, Bodo Rosenhahn
Leibniz University Hannover Institute for Information Processing (tnt) L3S, Leibniz University Hannover Institute for Information Processing (tnt) L3S, Leibniz University Hannover Institute for Information Processing (tnt) L3S
Abstract:
Explanations in Computer Vision are often desired, but most Deep Neural Networks can only provide saliency maps with questionable faithfulness. SelfExplaining Neural Networks (SENN) extract interpretable concepts with fidelity, diversity, and grounding to combine them linearly for decision-making. While they can explain what was recognized, initial realizations lack accuracy and general applicability. We propose the Quantized-Self-Explaining Neural Network “Q-SENN”. Q-SENN satisfies or exceeds the desiderata of SENN while being applicable to more complex datasets and maintaining most or all of the accuracy of an uninterpretable baseline model, outperforming previous work in all considered metrics. Q-SENN describes the relationship between every class and feature as either positive, negative or neutral instead of an arbitrary number of possible relations, enforcing more binary human-friendly features. Since every class is assigned just 5 interpretable features on average, Q-SENN shows convincing local and global interpretability. Additionally, we propose a feature alignment method, capable of aligning learned features with human language-based concepts without additional supervision. Thus, what is learned can be more easily verbalized. The code is published: https://github.com/ThomasNorr/Q-SENN



Paperid:2385
Authors:Genki Osada, Tsubasa Takahashi, Takashi Nishide
LINE Corporation, LINE Corporation, University of Tsukuba
Abstract:
Outof-distribution (OOD) detection is crucial to safety-critical machine learning applications and has been extensively studied. While recent studies have predominantly focused on classifier-based methods, research on deep generative model (DGM)-based methods have lagged relatively. This disparity may be attributed to a perplexing phenomenon: DGMs often assign higher likelihoods to unknown OOD inputs than to their known training data. This paper focuses on explaining the underlying mechanism of this phenomenon. We propose a hypothesis that less complex images concentrate in high-density regions in the latent space, resulting in a higher likelihood assignment in the Normalizing Flow (NF). We experimentally demonstrate its validity for five NF architectures, concluding that their likelihood is untrustworthy. Additionally, we show that this problem can be alleviated by treating image complexity as an independent variable. Finally, we provide evidence of the potential applicability of our hypothesis in another DGM, PixelCNN++.



Paperid:2386
Authors:Chao Pan, Qing Li, Xin Yao
Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen 518055, China Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China The Hong Kong Polytechnic University, Hong Kong, China, The Hong Kong Polytechnic University, Hong Kong, China, Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen 518055, China Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
Abstract:
Traditional adversarial training, while effective at improving machine learning model robustness, is computationally intensive. Fast Adversarial Training (FAT) addresses this by using a singlestep attack to generate adversarial examples more efficiently. Nonetheless, FAT is susceptible to a phenomenon known as catastrophic overfitting, wherein the model's adversarial robustness abruptly collapses to zero during the training phase. To address this challenge, recent studies have suggested adopting adversarial initialization with Fast Gradient Sign Method Adversarial Training (FGSM-AT), which recycles adversarial perturbations from prior epochs by computing gradient momentum. However, our research has uncovered a flaw in this approach. Given that data augmentation is employed during the training phase, the samples in each epoch are not identical. Consequently, the method essentially yields not the adversarial perturbation of a singular sample, but rather the Universal Adversarial Perturbation (UAP) of a sample and its data augmentation. This insight has led us to explore the potential of using UAPs for adversarial initialization within the context of FGSM-AT. We have devised various strategies for adversarial initialization utilizing UAPs, including single, class-based, and feature-based UAPs. Experiments conducted on three distinct datasets demonstrate that our method achieves an improved trade-off among robustness, computational cost, and memory footprint. Code is available at https://github.com/fzjcdt/fgsm-uap.



Paperid:2387
Authors:Mateo Perez, Fabio Somenzi, Ashutosh Trivedi
University of Colorado Boulder, University of Colorado Boulder, University of Colorado Boulder
Abstract:
Linear temporal logic (LTL) and omegaregular objectives---a superset of LTL---have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.



Paperid:2388
Authors:Mario Alfonso Prado-Romero, Bardh Prenkaj, Giovanni Stilo
Gran Sasso Science Institute, Sapienza University of Rome, University of L' Aquila
Abstract:
Counterfactual Explanation (CE) techniques have garnered attention as a means to provide insights to the users engaging with AI systems. While extensively researched in domains such as medical imaging and autonomous vehicles, Graph Counterfactual Explanation (GCE) methods have been comparatively underexplored. GCEs generate a new graph similar to the original one, with a different outcome grounded on the underlying predictive model. Among these GCE techniques, those rooted in generative mechanisms have received relatively limited investigation despite demonstrating impressive accomplishments in other domains, such as artistic styles and natural language modelling. The preference for generative explainers stems from their capacity to generate counterfactual instances during inference, leveraging autonomously acquired perturbations of the input graph. Motivated by the rationales above, our study introduces RSGG-CE, a novel Robust Stochastic Graph Generator for Counterfactual Explanations able to produce counterfactual examples from the learned latent space considering a partially ordered generation sequence. Furthermore, we undertake quantitative and qualitative analyses to compare RSGG-CE's performance against SoA generative explainers, highlighting its increased ability to engendering plausible counterfactual candidates.



Paperid:2389
Authors:Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal
Princeton University, Princeton University, Princeton University, Princeton University, Princeton University, Princeton University
Abstract:
Warning: this paper contains data, prompts, and model outputs that are offensive in nature. Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions (that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.



Paperid:2390
Authors:Omer Reingold, Judy Hanwen Shen, Aditi Talati
Stanford University, Stanford University, Stanford University
Abstract:
While modern explanation methods have been shown to be inconsistent and contradictory, the explainability of blackbox models nevertheless remains desirable. When the role of explanations extends from understanding models to aiding decision making, the semantics of explanations is not always fully understood – to what extent do explanations ``explain” a decision and to what extent do they merely advocate for a decision? Can we help humans gain insights from explanations accompanying correct predictions and not over-rely on incorrect predictions advocated for by explanations? With this perspective in mind, we introduce the notion of dissenting explanations: conflicting predictions with accompanying explanations. We first explore the advantage of dissenting explanations in the setting of model multiplicity, where multiple models with similar performance may have different predictions. Through a human study on the task of identifying deceptive reviews, we demonstrate that dissenting explanations reduce overreliance on model predictions, without reducing overall accuracy. Motivated by the utility of dissenting explanations we present both global and local methods for their generation.



Paperid:2391
Authors:Yao Rong, Peizhu Qian, Vaibhav Unhelkar, Enkelejda Kasneci
Technical University of Munich, Rice University, Rice University, Technical University of Munich
Abstract:
Effectively explaining decisions of blackbox machine learning models is critical to responsible deployment of AI systems that rely on them. Recognizing their importance, the field of explainable AI (XAI) provides several techniques to generate these explanations. Yet, there is relatively little emphasis on the user (the explainee) in this growing body of work and most XAI techniques generate "one-size-fits-all'' explanations. To bridge this gap and achieve a step closer towards human-centered XAI, we present I-CEE, a framework that provides Image Classification Explanations tailored to User Expertise. Informed by existing work, I-CEE explains the decisions of image classification models by providing the user with an informative subset of training data (i.e., example images), corresponding local explanations, and model decisions. However, unlike prior work, I-CEE models the informativeness of the example images to depend on user expertise, resulting in different examples for different users. We posit that by tailoring the example set to user expertise, I-CEE can better facilitate users' understanding and simulatability of the model. To evaluate our approach, we conduct detailed experiments in both simulation and with human participants (N = 100) on multiple datasets. Experiments with simulated users show that I-CEE improves users' ability to accurately predict the model's decisions (simulatability) compared to baselines, providing promising preliminary results. Experiments with human participants demonstrate that our method significantly improves user simulatability accuracy, highlighting the importance of human-centered XAI.



Paperid:2392
Authors:Lucas Rosenblatt, Julia Stoyanovich, Christopher Musco
New York University, New York University, New York University
Abstract:
Differentially private (DP) mechanisms have been deployed in a variety of highimpact social settings (perhaps most notably by the U.S. Census). Since all DP mechanisms involve adding noise to results of statistical queries, they are expected to impact our ability to accurately analyze and learn from data, in effect trading off privacy with utility. Alarmingly, the impact of DP on utility can vary significantly among different sub-populations. A simple way to reduce this disparity is with stratification. First compute an independent private estimate for each group in the data set (which may be the intersection of several protected classes), then, to compute estimates of global statistics, appropriately recombine these group estimates. Our main observation is that naive stratification often yields high-accuracy estimates of population-level statistics, without the need for additional privacy budget. We support this observation theoretically and empirically. Our theoretical results center on the private mean estimation problem, while our empirical results center on extensive experiments on private data synthesis to demonstrate the effectiveness of stratification on a variety of private mechanisms. Overall, we argue that this straightforward approach provides a strong baseline against which future work on reducing utility disparities of DP mechanisms should be compared.



Paperid:2393
Authors:Mikołaj Sacha, Bartosz Jura, Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, Bartosz Zieliński
Faculty of Mathematics and Computer Science, Jagiellonian University Doctoral School of Exact and Natural Sciences, Jagiellonian University, Łukasiewicz Research Network – Poznań Institute of Technology Faculty of Management and Social Communication, Jagiellonian University, Faculty of Mathematics and Computer Science, Jagiellonian University Doctoral School of Exact and Natural Sciences, Jagiellonian University Ardigen SA, Faculty of Mathematics and Computer Science, Jagiellonian University, Faculty of Mathematics and Computer Science, Jagiellonian University, Faculty of Mathematics and Computer Science, Jagiellonian University IDEAS NCBR
Abstract:
Prototypical partsbased networks are becoming increasingly popular due to their faithful self-explanations. However, their similarity maps are calculated in the penultimate network layer. Therefore, the receptive field of the prototype activation region often depends on parts of the image outside this region, which can lead to misleading interpretations. We name this undesired behavior a spatial explanation misalignment and introduce an interpretability benchmark with a set of dedicated metrics for quantifying this phenomenon. In addition, we propose a method for misalignment compensation and apply it to existing state-of-the-art models. We show the expressiveness of our benchmark and the effectiveness of the proposed compensation methodology through extensive empirical studies.



Paperid:2394
Authors:Zijing Shi, Meng Fang, Ling Chen, Yali Du, Jun Wang
University of Technology Sydney, University of Liverpool, University of Technology Sydney, King's College London, University College London
Abstract:
Training reinforcement learning (RL) agents to achieve desired goals while also acting morally is a challenging problem. Transformerbased language models (LMs) have shown some promise in moral awareness, but their use in different contexts is problematic because of the complexity and implicitness of human morality. In this paper, we build on text-based games, which are challenging environments for current RL agents, and propose the HuMAL (Human-guided Morality Awareness Learning) algorithm, which adaptively learns personal values through human-agent collaboration with minimal manual feedback. We evaluate HuMAL on the Jiminy Cricket benchmark, a set of text-based games with various scenes and dense morality annotations, using both simulated and actual human feedback. The experimental results demonstrate that with a small amount of human feedback, HuMAL can improve task performance and reduce immoral behavior in a variety of games and is adaptable to different personal values.



Paperid:2395
Authors:Stanley Simoes, Deepak P, Muiris MacCarthaigh
Queen's University Belfast, Queen's University Belfast, Queen's University Belfast
Abstract:
There has been much recent interest in developing fair clustering algorithms that seek to do justice to the representation of groups defined along sensitive attributes such as race and sex. Within the centroid clustering paradigm, these algorithms are seen to generate clusterings where different groups are disadvantaged within different clusters with respect to their representativity, i.e., distance to centroid. In view of this deficiency, we propose a novel notion of clusterlevel centroid fairness that targets the representativity unfairness borne by groups within each cluster, along with a metric to quantify the same. Towards operationalising this notion, we draw on ideas from political philosophy aligned with consideration for the worst-off group to develop Fair-Centroid; a new clustering method that focusses on enhancing the representativity of the worst-off group within each cluster. Our method uses an iterative optimisation paradigm wherein an initial cluster assignment is refined by reassigning objects to clusters such that the worst-off group in each cluster is benefitted. We compare our notion with a related fairness notion and show through extensive empirical evaluations on real-world datasets that our method significantly enhances cluster-level centroid fairness at low impact on cluster coherence.



Paperid:2396
Authors:Hwanjun Song, Minseok Kim, Jae-Gil Lee
KAIST, Amazon, KAIST
Abstract:
Multilabel classification poses challenges due to imbalanced and noisy labels in training data. In this paper, we propose a unified data augmentation method, named BalanceMix, to address these challenges. Our approach includes two samplers for imbalanced labels, generating minority-augmented instances with high diversity. It also refines multi-labels at the label-wise granularity, categorizing noisy labels as clean, re-labeled, or ambiguous for robust optimization. Extensive experiments on three benchmark datasets demonstrate that BalanceMix outperforms existing state-of-the-art methods. We release the code at https://github.com/DISL-Lab/BalanceMix.



Paperid:2397
Authors:Yuwei Sun, Hideya Ochiai
The University of Tokyo RIKEN AIP, The University of Tokyo
Abstract:
Visual Question Answering (VQA) based on multimodal data facilitates real-life applications such as home robots and medical diagnoses. One significant challenge is to devise a robust decentralized learning framework for various client models where centralized data collection is refrained due to confidentiality concerns. This work aims to tackle privacy-preserving VQA by decoupling a multi-modal model into representation modules and a contrastive module, leveraging inter-module gradients sharing and inter-client weight sharing. To this end, we propose Bidirectional Contrastive Split Learning (BiCSL) to train a global multi-modal model on the entire data distribution of decentralized clients. We employ the contrastive loss that enables a more efficient self-supervised learning of decentralized modules. Comprehensive experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models, demonstrating the effectiveness of the proposed method. Furthermore, we inspect BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently, BiCSL shows significantly enhanced resilience when exposed to the multi-modal adversarial attack compared to the centralized learning method, which provides a promising approach to decentralized multi-modal learning.



Paperid:2398
Authors:Masoud Taghikhah, Nishant Kumar, Siniša Šegvić, Abouzar Eslami, Stefan Gumhold
Faculty of Computer Science, Technische Universität Dresden, Dresden, Germany, Faculty of Computer Science, Technische Universität Dresden, Dresden, Germany, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia, Translational Research Lab, Carl Zeiss Meditec AG, Munich, Germany, Faculty of Computer Science, Technische Universität Dresden, Dresden, Germany
Abstract:
Discriminative learning effectively predicts true object class for image classification. However, it often results in false positives for outliers, posing critical concerns in applications like autonomous driving and video surveillance systems. Previous attempts to address this challenge involved training image classifiers through contrastive learning using actual outlier data or synthesizing outliers for selfsupervised learning. Furthermore, unsupervised generative modeling of inliers in pixel space has shown limited success for outlier detection. In this work, we introduce a quantile-based maximum likelihood objective for learning the inlier distribution to improve the outlier separation during inference. Our approach fits a normalizing flow to pre-trained discriminative features and detects the outliers according to the evaluated log-likelihood. The experimental evaluation demonstrates the effectiveness of our method as it surpasses the performance of the state-of-the-art unsupervised methods for outlier detection. The results are also competitive compared with a recent self-supervised approach for outlier detection. Our work allows to reduce dependency on well-sampled negative training data, which is especially important for domains like medical diagnostics or remote sensing.



Paperid:2399
Authors:Zhen Tan, Tianlong Chen, Zhenyu Zhang, Huan Liu
Arizona State University, University of North Carolina at Chapel Hill, University of Texas at Austin, Arizona State University
Abstract:
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``blackbox'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements.



Paperid:2400
Authors:Yun-Da Tsai, Cayon Liow, Yin Sheng Siang, Shou-De Lin
National Taiwan University, National Taiwan University, National Taiwan University, National Taiwan University
Abstract:
This paper reveals a data bias issue that can profoundly hinder the performance of machine learning models in malicious URL detection. We describe how such bias can be diagnosed using interpretable machine learning techniques and further argue that such biases naturally exist in the real world security data for training a classification model. To counteract these challenges, we propose a debiased training strategy that can be applied to most deeplearning based models to alleviate the negative effects of the biased features. The solution is based on the technique of adversarial training to train deep neural networks learning invariant embedding from biased data. Through extensive experimentation, we substantiate that our innovative strategy fosters superior generalization capabilities across both CNN-based and RNN-based detection models. The findings presented in this work not only expose a latent issue in the field but also provide an actionable remedy, marking a significant step forward in the pursuit of more reliable and robust malicious URL detection.



Paperid:2401
Authors:Yuanpeng Tu, Yuxi Li, Boshen Zhang, Liang Liu, Jiangning Zhang, Yabiao Wang, Cairong Zhao
Tongji University, Tencent, Tencent, Tencent, Zhejiang University, Tencent, Tongji University
Abstract:
Robust autonomous driving requires agents to accurately identify unexpected areas (anomalies) in urban scenes. To this end, some critical issues remain open: how to design advisable metric to measure anomalies, and how to properly generate training samples of anomaly data? Classical effort in anomaly detection usually resorts to pixelwise uncertainty or sample synthesis, which ignores the contextual information and sometimes requires auxiliary data with fine-grained annotations. On the contrary, in this paper, we exploit the strong context-dependent nature of segmentation task and design an energy-guided self-supervised frameworks for anomaly segmentation, which optimizes an anomaly head by maximizing likelihood of self-generated anomaly pixels. For this purpose, we design two estimators to model anomaly likelihood, one is a task-agnostic binary estimator and the other depicts the likelihood as residual of task-oriented joint energy. Based on proposed estimators, we devise an adaptive self-supervised training framework, which exploits the contextual reliance and estimated likelihood to refine mask annotations in anomaly areas. We conduct extensive experiments on challenging Fishyscapes and Road Anomaly benchmarks, demonstrating that without any auxiliary data or synthetic models, our method can still achieves comparable performance to supervised competitors. Code is available at https://github.com/yuanpengtu/SLEEG.



Paperid:2402
Authors:Giovanni Varricchione, Natasha Alechina, Mehdi Dastani, Giuseppe De Giacomo, Brian Logan, Giuseppe Perelli
Utrecht University, Open University Utrecht University, Utrecht University, University of Oxford, University of Aberdeen Utrecht University, Sapienza University of Rome
Abstract:
We present PurePast Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.



Paperid:2403
Authors:Akifumi Wachi, Wataru Hashimoto, Kazumune Hashimoto
LINE Corporation, Osaka University, Osaka University
Abstract:
Safety is an indispensable requirement for applying reinforcement learning (RL) to real problems. Although there has been a surge of safe RL algorithms proposed in recent years, most existing work typically 1) relies on receiving numeric safety feedback; 2) does not guarantee safety during the learning process; 3) limits the problem to a priori known, deterministic transition dynamics; and/or 4) assume the existence of a known safe policy for any states. Addressing the issues mentioned above, we thus propose Longterm Binary-feedback Safe RL (LoBiSaRL), a safe RL algorithm for constrained Markov decision processes (CMDPs) with binary safety feedback and an unknown, stochastic state transition function. LoBiSaRL optimizes a policy to maximize rewards while guaranteeing long-term safety that an agent executes only safe state-action pairs throughout each episode with high probability. Specifically, LoBiSaRL models the binary safety function via a generalized linear model (GLM) and conservatively takes only a safe action at every time step while inferring its effect on future safety under proper assumptions. Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability. Finally, our empirical results demonstrate that our algorithm is safer than existing methods without significantly compromising performance in terms of reward.



Paperid:2404
Authors:Madeleine Waller, Odinaldo Rodrigues, Oana Cocarascu
King's College London, King's College London, King's College London
Abstract:
As algorithmic decisionmaking systems become more prevalent in society, ensuring the fairness of these systems is becoming increasingly important. Whilst there has been substantial research in building fair algorithmic decision-making systems, the majority of these methods require access to the training data, including personal characteristics, and are not transparent regarding which individuals are classified unfairly. In this paper, we propose a novel model-agnostic argumentation-based method to determine why an individual is classified differently in comparison to similar individuals. Our method uses a quantitative argumentation framework to represent attribute-value pairs of an individual and of those similar to them, and uses a well-known semantics to identify the attribute-value pairs in the individual contributing most to their different classification. We evaluate our method on two datasets commonly used in the fairness literature and illustrate its effectiveness in the identification of bias.



Paperid:2405
Authors:Lei Wang, Xu Chen, Zhenhua Dong, Quanyu Dai
Renmin University of China, Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods, Huawei Noah's Ark Lab, Huawei Noah's Ark Lab
Abstract:
Recommender systems have a significant impact on various realworld applications, shaping people's daily lives and enhancing productivity. Traditional recommender models aim to collect extensive user information to accurately estimate user preferences. However, in practical scenarios, users may not want all their behaviors to be included in the model training process. This paper introduces a novel recommendation paradigm that allows users to indicate their ``willingness'' regarding which data should contribute to model training. The models are then optimized to maximize utility, which considers the trade-off between recommendation performance and respecting user preferences. The recommendation problem is formulated as a multiplayer game, with each user acting as a player and using a selection vector to indicate their willingness to include specific interacted items in training. To efficiently solve this game, an influence function-based model is proposed to approximate recommendation performances for different actions without re-optimizing the model. Furthermore, an enhanced model leveraging multiple anchor actions for the influence function is introduced to improve performance approximation accuracy. The convergence rate of the algorithm is theoretically analyzed, and the advantages of incorporating multiple anchor actions are demonstrated. Extensive experiments on both simulated and real-world datasets validate the effectiveness of the proposed models in balancing recommendation quality and user willingness. To promote this research direction, we have released our project at https://paitesanshi.github.io/IFRQE/.



Paperid:2406
Authors:Min Wang, Hao Yang, Jincai Huang, Qing Cheng
College of Systems Engineering, National University of Defense Technology, College of Systems Engineering, National University of Defense Technology, College of Systems Engineering, National University of Defense Technology, College of Systems Engineering, National University of Defense Technology
Abstract:
Confidence calibration in Graph Neural Networks (GNNs) aims to align a model's predicted confidence with its actual accuracy. Recent studies have indicated that GNNs exhibit an underconfidence bias, which contrasts the over-confidence bias commonly observed in deep neural networks. However, our deeper investigation into this topic reveals that not all GNNs exhibit this behavior. Upon closer examination of message passing in GNNs, we found a clear link between message aggregation and confidence levels. Specifically, GNNs with extensive message aggregation, often seen in deep architectures or when leveraging large amounts of labeled data, tend to exhibit overconfidence. This overconfidence can be attributed to factors like over-learning and over-smoothing. Conversely, GNNs with fewer layers, known for their balanced message passing and superior node representation, may exhibit under-confidence. To counter these confidence biases, we introduce the Adaptive Unified Label Smoothing (AU-LS) technique. Our experiments show that AU-LS outperforms existing methods, addressing both over and under-confidence in various GNN scenarios.



Paperid:2407
Authors:Zhenzhong Wang, Qingyuan Zeng, Wanyu Lin, Min Jiang, Kay Chen Tan
The Hong Kong Polytechnic University, Xiamen University, The Hong Kong Polytechnic University, Xiamen University, The Hong Kong Polytechnic University
Abstract:
A plethora of fair graph neural networks (GNNs) have been proposed to promote algorithmic fairness for highstake real-life contexts. Meanwhile, explainability is generally proposed to help machine learning practitioners debug models by providing human-understandable explanations. However, seldom work on explainability is made to generate explanations for fairness diagnosis in GNNs. From the explainability perspective, this paper explores the problem of what subgraph patterns cause the biased behavior of GNNs, and what actions could practitioners take to rectify the bias? By answering the two questions, this paper aims to produce compact, diagnostic, and actionable explanations that are responsible for discriminatory behavior. Specifically, we formulate the problem of generating diagnostic and actionable explanations as a multi-objective combinatorial optimization problem. To solve the problem, a dedicated multi-objective evolutionary algorithm is presented to ensure GNNs' explainability and fairness in one go. In particular, an influenced nodes-based gradient approximation is developed to boost the computation efficiency of the evolutionary algorithm. We provide a theoretical analysis to illustrate the effectiveness of the proposed framework. Extensive experiments have been conducted to demonstrate the superiority of the proposed method in terms of classification performance, fairness, and interpretability.



Paperid:2408
Authors:Zhuoyuan Wang, Reece Keller, Xiyu Deng, Kenta Hoshino, Takashi Tanaka, Yorie Nakahira
Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University, Kyoto University, University of Texas at Austin, Carnegie Mellon University
Abstract:
Optimal and safetycritical control are fundamental problems for stochastic systems, and are widely considered in real-world scenarios such as robotic manipulation and autonomous driving. In this paper, we consider the problem of efficiently finding optimal and safe control for high-dimensional systems. Specifically, we propose to use dimensionality reduction techniques from a comparison theorem for stochastic differential equations together with a generalizable physics-informed neural network to estimate the optimal value function and the safety probability of the system. The proposed framework results in substantial sample efficiency improvement compared to existing methods. We further develop an autoencoder-like neural network to automatically identify the low-dimensional features in the system to enhance the ease of design for system integration. We also provide experiments and quantitative analysis to validate the efficacy of the proposed method. Source code is available at https://github.com/jacobwang925/path-integral-PINN.



Paperid:2409
Authors:Honghao Wei, Xin Liu, Lei Ying
Washington State University, ShanghaiTech University, University of Michigan, Ann Arbor
Abstract:
This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: (i) the RL agent knows a safe action set for every state or knows a safe graph in which all the stateaction-state triples are safe, and (ii) the constraint/cost functions are linear. In this paper, we consider safe RL with instantaneous hard constraints without assumption (i) and generalize (ii) to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves O(√{d³H⁴K}) regret and O(H √{dK}) hard constraint violation when the cost function is linear and O(H?ₖ √{K}) hard constraint violation when the cost function belongs to RKHS. Here K is the learning horizon, H is the length of each episode, and ?ₖ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon K, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.



Paperid:2410
Authors:Jing Wu, Munawar Hayat, Mingyi Zhou, Mehrtash Harandi
Monash University, Monash University, Monash University, Monash University
Abstract:
Federated Learning (FL) is a distributed learning paradigm that enhances users' privacy by eliminating the need for clients to share raw, private data with the server. Despite the success, recent studies expose the vulnerability of FL to model inversion attacks, where adversaries reconstruct users’ private data via eavesdropping on the shared gradient information. We hypothesize that a key factor in the success of such attacks is the low entanglement among gradients per data within the batch during stochastic optimization. This creates a vulnerability that an adversary can exploit to reconstruct the sensitive data. Building upon this insight, we present a simple, yet effective defense strategy that obfuscates the gradients of the sensitive data with concealed samples. To achieve this, we propose synthesizing concealed samples to mimic the sensitive data at the gradient level while ensuring their visual dissimilarity from the actual sensitive data. Compared to the previous art, our empirical evaluations suggest that the proposed technique provides the strongest protection while simultaneously maintaining the FL performance. Code is located at https://github.com/JingWu321/DCS2.



Paperid:2411
Authors:Yuefei Wu, Bin Shi, Bo Dong, Qinghua Zheng, Hua Wei
School of Computer Science and Technology, Xi’an Jiaotong University Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University School of Distance Education, Xi’an Jiaotong University, School of Computer Science and Technology, Xi’an Jiaotong University Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Arizona State University
Abstract:
Deep Evidential Regression (DER) places a prior on the original Gaussian likelihood and treats learning as an evidence acquisition process to quantify uncertainty. For the validity of the evidence theory, DER requires specialized activation functions to ensure that the prior parameters remain nonnegative. However, such constraints will trigger evidence contraction, causing sub-optimal performance. In this paper, we analyse DER theoretically, revealing the intrinsic limitations for sub-optimal performance: the non-negativity constraints on the Normal Inverse-Gamma (NIG) prior parameter trigger the evidence contraction under the specialized activation function, which hinders the optimization of DER performance. On this basis, we design a Non-saturating Uncertainty Regularization term, which effectively ensures that the performance is further optimized in the right direction. Experiments on real-world datasets show that our proposed approach improves the performance of DER while maintaining the ability to quantify uncertainty.



Paperid:2412
Authors:Caiyi Yang, Javad Ghaderi
Columbia University, Columbia University
Abstract:
We consider decentralized learning over a network of workers with heterogeneous datasets, in the presence of Byzantine workers. Byzantine workers may transmit arbitrary or malicious values to neighboring workers, leading to degradation in overall performance. The heterogeneous nature of the training data across various workers complicates the identification and mitigation of Byzantine workers. To address this complex problem, we introduce a resilient decentralized learning approach that combines the gradient descent algorithm with a novel robust aggregator. Specifically, we propose a removethen-clip aggregator, whereby each benign worker meticulously filters the neighbors' values and subsequently projects the remaining values to a sphere centered at its local value, with an appropriately selected radius. We prove that our proposed method converges to a neighborhood of a stationary point for non-convex objectives under standard assumptions. Furthermore, empirical evaluations are provided to demonstrate the superior performance of our method in comparison to existing algorithms, under various Byzantine attack models.



Paperid:2413
Authors:Weisong Yang, Rafael Poyiadzi, Niall Twomey, Raul Santos-Rodriguez
University of Bristol, GSK, University of Bristol, University of Bristol
Abstract:
In supervised learning, automatically assessing the quality of the labels before any learning takes place remains an open research question. In certain particular cases, hypothesis testing procedures have been proposed to assess whether a given instancelabel dataset is contaminated with class-conditional label noise, as opposed to uniform label noise. The existing theory builds on the asymptotic properties of the Maximum Likelihood Estimate for parametric logistic regression. However, the parametric assumptions on top of which these approaches are constructed are often too strong and unrealistic in practice. To alleviate this problem, in this paper we propose an alternative path by showing how similar procedures can be followed when the underlying model is a product of Local Maximum Likelihood Estimation that leads to more flexible nonparametric logistic regression models, which in turn are less susceptible to model misspecification. This different view allows for wider applicability of the tests by offering users access to a richer model class. Similarly to existing works, we assume we have access to anchor points which are provided by the users. We introduce the necessary ingredients for the adaptation of the hypothesis tests to the case of nonparametric logistic regression and empirically compare against the parametric approach presenting both synthetic and real-world case studies and discussing the advantages and limitations of the proposed approach.



Paperid:2414
Authors:Jayanth Yetukuri, Ian Hardy, Yevgeniy Vorobeychik, Berk Ustun, Yang Liu
University of California, Santa Cruz, University of California, Santa Cruz, Washington University in St. Louis, University of California, San Diego, University of California, Santa Cruz
Abstract:
Machine learning models now automate decisions in applications where we may wish to provide recourse to adversely affected individuals. In practice, existing methods to provide recourse return actions that fail to account for latent characteristics that are not captured in the model (e.g., age, sex, marital status). In this paper, we study how the cost and feasibility of recourse can change across these latent groups. We introduce a notion of grouplevel plausibility to identify groups of individuals with a shared set of latent characteristics. We develop a general-purpose clustering procedure to identify groups from samples. Further, we propose a constrained optimization approach to learn models that equalize the cost of recourse over latent groups. We evaluate our approach through an empirical study on simulated and real-world datasets, showing that it can produce models that have better performance in terms of overall costs and feasibility at a group level.



Paperid:2415
Authors:Xiangyu Yin, Sihao Wu, Jiaxu Liu, Meng Fang, Xingyu Zhao, Xiaowei Huang, Wenjie Ruan
University of Liverpool, University of Liverpool, University of Liverpool, University of Liverpool, University of Warwick, University of Liverpool, University of Liverpool
Abstract:
While GoalConditioned Reinforcement Learning (GCRL) has gained attention, its algorithmic robustness against adversarial perturbations remains unexplored. The attacks and robust representation training methods that are designed for traditional RL become less effective when applied to GCRL. To address this challenge, we first propose the Semi-Contrastive Representation attack, a novel approach inspired by the adversarial contrastive attack. Unlike existing attacks in RL, it only necessitates information from the policy function and can be seamlessly implemented during deployment. Then, to mitigate the vulnerability of existing GCRL algorithms, we introduce Adversarial Representation Tactics, which combines Semi-Contrastive Adversarial Augmentation with Sensitivity-Aware Regularizer to improve the adversarial robustness of the underlying RL agent against various types of perturbations. Extensive experiments validate the superior performance of our attack and defence methods across multiple state-of-the-art GCRL algorithms. Our code is available at https://github.com/TrustAI/ReRoGCRL.



Paperid:2416
Authors:Hengrui Zhang, Youfang Lin, Shuo Shen, Sheng Han, Kai Lv
School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China, Cooperation Product Department, Interactive Entertainment Group, Tencent, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China
Abstract:
In the domain of realworld agents, the application of Reinforcement Learning (RL) remains challenging due to the necessity for safety constraints. Previously, Constrained Reinforcement Learning (CRL) has predominantly focused on on-policy algorithms. Although these algorithms exhibit a degree of efficacy, their interactivity efficiency in real-world settings is sub-optimal, highlighting the demand for more efficient off-policy methods. However, off-policy CRL algorithms grapple with challenges in precise estimation of the C-function, particularly due to the fluctuations in the constrained Lagrange multiplier. Addressing this gap, our study focuses on the nuances of C-value estimation in off-policy CRL and introduces the Adaptive Ensemble C-learning (AEC) approach to reduce these inaccuracies. Building on state-of-the-art off-policy algorithms, we propose AEC-based CRL algorithms designed for enhanced task optimization. Extensive experiments on nine constrained robotics tasks reveal the superior interaction efficiency and performance of our algorithms in comparison to preceding methods.



Paperid:2417
Authors:Jiang Zhang, Qiong Wu, Yiming Xu, Cheng Cao, Zheng Du, Konstantinos Psounis
University of Southern California, Amazon.com, Inc., Amazon.com, Inc., Amazon.com, Inc., Amazon.com, Inc., University of Southern California
Abstract:
Toxic content detection is crucial for online services to remove inappropriate content that violates community standards. To automate the detection process, prior works have proposed varieties of machine learning (ML) approaches to train Language Models (LMs) for toxic content detection. However, both their accuracy and transferability across datasets are limited. Recently, Large Language Models (LLMs) have shown promise in toxic content detection due to their superior zeroshot and few-shot in-context learning ability as well as broad transferability on ML tasks. However, efficiently designing prompts for LLMs remains challenging. Moreover, the high run-time cost of LLMs may hinder their deployments in production. To address these challenges, in this work, we propose BD-LLM, a novel and efficient approach to bootstrapping and distilling LLMs for toxic content detection. Specifically, we design a novel prompting method named Decision-Tree-of-Thought (DToT) to bootstrap LLMs' detection performance and extract high-quality rationales. DToT can automatically select more fine-grained context to re-prompt LLMs when their responses lack confidence. Additionally, we use the rationales extracted via DToT to fine-tune student LMs. Our experimental results on various datasets demonstrate that DToT can improve the accuracy of LLMs by up to 4.6%. Furthermore, student LMs fine-tuned with rationales extracted via DToT outperform baselines on all datasets with up to 16.9% accuracy improvement, while being more than 60x smaller than conventional LLMs. Finally, we observe that student LMs fine-tuned with rationales exhibit better cross-dataset transferability.



Paperid:2418
Authors:Yanci Zhang, Han Yu
Nanyang Technological University, Nanyang Technological University
Abstract:
Federated learning (FL) is an emerging approach for training machine learning models collaboratively while preserving data privacy. The need for privacy protection makes it difficult for FL models to achieve global transparency and explainability. To address this limitation, we incorporate logicbased explanations into FL by proposing the Logical Reasoning-based eXplainable Federated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local logic rules based on their local data and send them, along with model updates, to the FL server. The FL server connects the local logic rules through a proper logical connector that is derived based on properties of client data, without requiring access to the raw data. In addition, the server also aggregates the local model updates with weight values determined by the quality of the clients’ local data as reflected by their uploaded logic rules. The results show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and 5.41% in terms of classification accuracy, rule accuracy and rule fidelity, respectively. The explicit rule evaluation and expression under LR-XFL enable human experts to validate and correct the rules on the server side, hence improving the global FL model’s robustness to errors. It has the potential to enhance the transparency of FL models for areas like healthcare and finance where both data privacy and explainability are important.



Paperid:2419
Authors:Yunruo Zhang, Lujia Shen, Shanqing Guo, Shouling Ji
Shandong University, Zhejiang University, Shandong University, Zhejiang University
Abstract:
Transformers based on attention mechanisms exhibit vulnerability to adversarial examples, posing a substantial threat to the security of their applications. Aiming to solve this problem, the concept of robustness certification is introduced to formally ascertain the presence of any adversarial example within a specified region surrounding a given sample. However, prior works have neglected the dependencies among inputs of softmax (the most complex function in attention mechanisms) during linear relaxations. This oversight has consequently led to imprecise certification results. In this work, we introduce GaLileo, a general linear relaxation framework designed to certify the robustness of Transformers. GaLileo effectively surmounts the tradeoff between precision and efficiency in robustness certification through our innovative n-dimensional relaxation approach. Notably, our relaxation technique represents a pioneering effort as the first linear relaxation for n-dimensional functions such as softmax. Our novel approach successfully transcends the challenges posed by the curse of dimensionality inherent in linear relaxations, thereby enhancing linear bounds by incorporating input dependencies. Our evaluations encompassed a thorough analysis utilizing the SST and Yelp datasets along with diverse Transformers of different depths and widths. The experimental results demonstrate that, as compared to the baseline method CROWN-BaF, GaLileo achieves up to 3.24 times larger certified radii while requiring similar running times. Additionally, GaLileo successfully attains certification for Transformers' robustness against multi-word lp perturbations, marking a notable accomplishment in this field.



Paperid:2420
Authors:Puning Zhao, Fei Yu, Zhiguo Wan
Zhejiang Lab, Zhejiang Lab, Zhejiang Lab
Abstract:
Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on epsilon, which stands for the ratio of attacked clients. Secondly, our approach does not need precise knowledge of epsilon. Thirdly, it allows different clients to have unequal data sizes. We then broaden our analysis to include noni.i.d data, such that clients have slightly different distributions.



Paperid:2421
Authors:Shanshan Zhao, Wenhai Cui, Bei Jiang, Linglong Kong, Xiaodong Yan
Mathematics Discipline, Shandong University, Mathematics Discipline, Shandong University, Mathematics Discipline, University of Alberta, Mathematics Discipline, University of Alberta, Mathematics Discipline, Shandong University Shandong National Center for Applied Mathematics
Abstract:
For ensuring the safety of users by protecting the privacy, the traditional privacypreserving bandit algorithm aiming to maximize the mean reward has been widely studied in scenarios such as online ride-hailing, advertising recommendations, and personalized healthcare. However, classical bandit learning is irresponsible in such practical applications as they fail to account for risks in online decision-making and ignore external system information. This paper firstly proposes privacy protected mean-volatility utility as the objective of bandit learning and proves its responsibility, because it aims at achieving the maximum probability of utility by considering the risk. Theoretically, our proposed responsible bandit learning is expected to achieve the fastest convergence rate among current bandit algorithms and generates more statistical power than classical normality-based test. Finally, simulation studies provide supporting evidence for the theoretical results and demonstrate stronger performance when using stricter privacy budgets.



Paperid:2422
Authors:Yue Zhao, Congyi Li, Kai Chen
Institute of Information Engineering, Chinese Academy of Sciences, China;, Institute of Information Engineering, Chinese Academy of Sciences, China; School of Cyber Security, University of Chinese Academy of Science, China, Institute of Information Engineering, Chinese Academy of Sciences, China; School of Cyber Security, University of Chinese Academy of Science, China
Abstract:
Recent advances in backdoor attacks, like leveraging complex triggers or stealthy implanting techniques, have introduced new challenges in backdoor scanning, limiting the usability of Deep Neural Networks (DNNs) in various scenarios. In this paper, we propose Unlearningbased Model Ablation (UMA), a novel approach to facilitate backdoor scanning and defend against advanced backdoor attacks. UMA filters out backdoor-irrelevant features by ablating the inherent features of the target class within the model and subsequently reveals the backdoor through dynamic trigger optimization. We evaluate our method on 1700 models (700 benign and 1000 trojaned) with 6 model structures, 7 different backdoor attacks and 4 datasets. Our results demonstrate that the proposed methodology effectively detect these advanced backdoors. Specifically, our method can achieve 91% AUC-ROC and 86.6% detection accuracy on average, which outperforms the baselines, including Neural Cleanse, ABS, K-Arm and MNTD.



Paperid:2423
Authors:Guangtao Zheng, Mengdi Huai, Aidong Zhang
University of Virginia, Iowa State University, University of Virginia
Abstract:
Single domain generalization (SDG) aims to train a robust model against unknown target domain shifts using data from a single source domain. Data augmentation has been proven an effective approach to SDG. However, the utility of standard augmentations, such as translate, or invert, has not been fully exploited in SDG; practically, these augmentations are used as a part of a data preprocessing procedure. Although it is intuitive to use many such augmentations to boost the robustness of a model to outof-distribution domain shifts, we lack a principled approach to harvest the benefit brought from multiple these augmentations. Here, we conceptualize standard data augmentations with learnable parameters as semantics transformations that can manipulate certain semantics of a sample, such as the geometry or color of an image. Then, we propose Adversarial learning with Semantics Transformations (AdvST) that augments the source domain data with semantics transformations and learns a robust model with the augmented data. We theoretically show that AdvST essentially optimizes a distributionally robust optimization objective defined on a set of semantics distributions induced by the parameters of semantics transformations. We demonstrate that AdvST can produce samples that expand the coverage on target domain data. Compared with the state-of-the-art methods, AdvST, despite being a simple method, is surprisingly competitive and achieves the best average SDG performance on the Digits, PACS, and DomainNet datasets. Our code is available at https://github.com/gtzheng/AdvST.



Paperid:2424
Authors:Li Zhong, Zilong Wang
University of California, San Diego, University of California, San Diego
Abstract:
Recently, large language models (LLMs) have shown an extraordinary ability to understand natural language and generate programming code. It has been a common practice for software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability, and robustness of the code generation from LLMs have not yet been thoroughly studied. The executable code is not equivalent to reliable and robust code, especially in the context of realworld software development. For example, the misuse of APIs in the generated code could lead to severe problems, such as resource leaks, program crashes, etc. Existing code evaluation benchmarks and datasets focus on crafting small tasks such as programming questions in coding interviews, which, however, deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from Stack Overflow on 18 representative Java APIs. We summarize the common misuse patterns of these APIs and evaluate them on current popular LLMs. The evaluation results show that even for GPT-4, 62% of the generated code contains API misuses, which would cause unexpected consequences if the code is introduced into real-world software.



Paperid:2425
Authors:Jiachen Zhou, Peizhuo Lv, Yibing Lan, Guozhu Meng, Kai Chen, Hualong Ma
Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China, Institute of Information Engineering, Chinese Academy of Sciences, China School of Cyber Security, University of Chinese Academy of Sciences, China
Abstract:
Dataset sanitization is a widely adopted proactive defense against poisoningbased backdoor attacks, aimed at filtering out and removing poisoned samples from training datasets. However, existing methods have shown limited efficacy in countering the ever-evolving trigger functions, and often leading to considerable degradation of benign accuracy. In this paper, we propose DataElixir, a novel sanitization approach tailored to purify poisoned datasets. We leverage diffusion models to eliminate trigger features and restore benign features, thereby turning the poisoned samples into benign ones. Specifically, with multiple iterations of the forward and reverse process, we extract intermediary images and their predicted labels for each sample in the original dataset. Then, we identify anomalous samples in terms of the presence of label transition of the intermediary images, detect the target label by quantifying distribution discrepancy, select their purified images considering pixel and feature distance, and determine their ground-truth labels by training a benign model. Experiments conducted on 9 popular attacks demonstrates that DataElixir effectively mitigates various complex attacks while exerting minimal impact on benign accuracy, surpassing the performance of baseline defense methods.



Paperid:2426
Authors:Pascal Zimmer, Sébastien Andreina, Giorgia Azzurra Marson, Ghassan Karame
Ruhr-Universität Bochum, Germany, NEC Labs Europe, Germany, NEC Labs Europe, Germany, Ruhr-Universität Bochum, Germany
Abstract:
Although promising, existing defenses against querybased attacks share a common limitation: they offer increased robustness against attacks at the price of a considerable accuracy drop on clean samples. In this work, we show how to efficiently establish, at test-time, a solid tradeoff between robustness and accuracy when mitigating query-based attacks. Given that these attacks necessarily explore low-confidence regions, our insight is that activating dedicated defenses, such as random noise defense and random image transformations, only for low-confidence inputs is sufficient to prevent them. Our approach is independent of training and supported by theory. We verify the effectiveness of our approach for various existing defenses by conducting extensive experiments on CIFAR-10, CIFAR-100, and ImageNet. Our results confirm that our proposal can indeed enhance these defenses by providing better tradeoffs between robustness and accuracy when compared to state-of-the-art approaches while being completely training-free.



Paperid:2427
Authors:Adam Żychowski, Andrew Perrault, Jacek Mańdziuk
Warsaw University of Technology, The Ohio State University, Warsaw University of Technology AGH University of Krakow
Abstract:
In recent years, there has been growing interest in developing robust machine learning (ML) models that can withstand adversarial attacks, including one of the most widely adopted, efficient, and interpretable ML algorithms—decision trees (DTs). This paper proposes a novel coevolutionary algorithm (CoEvoRDT) designed to create robust DTs capable of handling noisy highdimensional data in adversarial contexts. Motivated by the limitations of traditional DT algorithms, we leverage adaptive coevolution to allow DTs to evolve and learn from interactions with perturbed input data. CoEvoRDT alternately evolves competing populations of DTs and perturbed features, enabling construction of DTs with desired properties. CoEvoRDT is easily adaptable to various target metrics, allowing the use of tailored robustness criteria such as minimax regret. Furthermore, CoEvoRDT has potential to improve the results of other state-of-the-art methods by incorporating their outcomes (DTs they produce) into the initial population and optimize them in the process of coevolution. Inspired by the game theory, CoEvoRDT utilizes mixed Nash equilibrium to enhance convergence. The method is tested on 20 popular datasets and shows superior performance compared to 4 state-of-the-art algorithms. It outperformed all competing methods on 13 datasets with adversarial accuracy metrics, and on all 20 considered datasets with minimax regret. Strong experimental results and flexibility in choosing the error measure make CoEvoRDT a promising approach for constructing robust DTs in real-world applications.



Paperid:2428
Authors:Kshitiz ., Sonu Shreshtha, Bikash Dutta, Muskan Dosi, Mayank Vatsa, Richa Singh, Saket Anand, Sudeep Sarkar, Sevaram Mali Parihar
Indian Institute of Technology Jodhpur, India, Indian Institute of Technology Jodhpur, India, Indian Institute of Technology, Jodhpur, India, Indian Institute of Technology, Jodhpur, India, Indian Institute of Technology, Jodhpur, India, Indian Institute of Technology, Jodhpur, India, Indraprastha Institute of Information Technology Delhi, India, University of South Florida, Tampa, Florida, USA, Crane Conservationist, Khichan, India
Abstract:
Automatic recognition of bird behavior from longterm, un controlled outdoor imagery can contribute to conservation efforts by enabling large-scale monitoring of bird populations. Current techniques in AI-based wildlife monitoring have focused on short-term tracking and monitoring birds individually rather than in species-rich flocks. We present Bird-Collect, a comprehensive benchmark dataset for monitoring dense bird flock attributes. It includes a unique collection of more than 6,000 high-resolution images of Demoiselle Cranes (Anthropoides virgo) feeding and nesting in the vicinity of Khichan region of Rajasthan. Particularly, each image contains an average of 190 individual birds, illustrating the complex dynamics of densely populated bird flocks on a scale that has not previously been studied. In addition, a total of 433 distinct pictures captured at Keoladeo National Park, Bharatpur provide a comprehensive representation of 34 distinct bird species belonging to various taxonomic groups. These images offer details into the diversity and the behaviour of birds in vital natural ecosystem along the migratory flyways. Additionally, we provide a set of 2,500 point-annotated samples which serve as ground truth for benchmarking various computer vision tasks like crowd counting, density estimation, segmentation, and species classification. The benchmark performance for these tasks highlight the need for tailored approaches for specific wildlife applications, which include varied conditions including views, illumination, and resolutions. With around 46.2 GBs in size encompassing data collected from two distinct nesting ground sets, it is the largest birds dataset containing detailed annotations, showcasing a substantial leap in bird research possibilities. We intend to publicly release the dataset to the research community. The database is available at: https://iab-rubric.org/resources/wildlife-dataset/birdcollect



Paperid:2429
Authors:Gabriel Agostini, Emma Pierson, Nikhil Garg
Cornell Tech, Cornell Tech, Cornell Tech
Abstract:
Decisionmakers often observe the occurrence of events through a reporting process. City governments, for example, rely on resident reports to find and then resolve urban infrastructural problems such as fallen street trees, flooded basements, or rat infestations. Without additional assumptions, there is no way to distinguish events that occur but are not reported from events that truly did not occur--a fundamental problem in settings with positive-unlabeled data. Because disparities in reporting rates correlate with resident demographics, addressing incidents only on the basis of reports leads to systematic neglect in neighborhoods that are less likely to report events. We show how to overcome this challenge by leveraging the fact that events are spatially correlated. Our framework uses a Bayesian spatial latent variable model to infer event occurrence probabilities and applies it to storm-induced flooding reports in New York City, further pooling results across multiple storms. We show that a model accounting for under-reporting and spatial correlation predicts future reports more accurately than other models, and further induces a more equitable set of inspections: its allocations better reflect the population and provide equitable service to non-white, less traditionally educated, and lower-income residents. This finding reflects heterogeneous reporting behavior learned by the model: reporting rates are higher in Census tracts with higher populations, proportions of white residents, and proportions of owner-occupied households. Our work lays the groundwork for more equitable proactive government services, even with disparate reporting behavior.



Paperid:2430
Authors:Jatin Agrawal, Mukul Kumar, Avtansh Tiwari, Sachin Danisetty, Soma Dhavala, Nakul Jain, Prasaanth Balraj, Niket Singh, Siddhant Shingi, Jayakrishna Kurada, Raghuram Rao, S Anand, Nishant Kumar
Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Wadhwani Institute for Artificial Intelligence, Central TB Division, India, Central TB Division, India, Central TB Division, India
Abstract:
Line Probe Assay (LPA) is a widely used method for diagnosing drugresistant tuberculosis (DRTB), but it is a time-consuming and labor-intensive process that requires expert interpretation. DRTB is a significant threat to global TB control efforts and its prompt diagnosis is critical for initiating appropriate treatment. In this paper, we present an automated LPA test interpretation solution that uses computer vision techniques to extract and analyze strips from LPA sheets and uses machine learning algorithms to produce drug sensitivity and resistivity outcomes with extremely high precision and recall. We also develop OCR models to eliminate manual data entry to further reduce the overall time. Our solution comprises a rejection module that flags ambiguous and novel samples that are then referred to experienced lab technicians. This results in increased trust in the solution. To evaluate our solution, we curate an extensive and diverse dataset of LPA strips annotated by multiple microbiologists across India. Our solution achieves more than 95% accuracy for all drugs on this dataset. The proposed solution has the potential to increase the efficiency, standardization of LPA test interpretation, and fast-tracking the dissemination of results to end-users via a designated Management Information System (MIS).



Paperid:2431
Authors:Inaam Ashraf, Janine Strotherm, Luca Hermes, Barbara Hammer
CITEC, Bielefeld University, CITEC, Bielefeld University, CITEC, Bielefeld University, CITEC, Bielefeld University
Abstract:
Water distribution systems (WDS) are an integral part of critical infrastructure which is pivotal to urban development. As 70% of the world's population will likely live in urban environments in 2050, efficient simulation and planning tools for WDS play a crucial role in reaching UN's sustainable developmental goal (SDG) 6 "Clean water and sanitation for all". In this realm, we propose a novel and efficient machine learning emulator, more precisely, a physics-informed deep learning (DL) model, for hydraulic state estimation in WDS. Using a recursive approach, our model only needs a few graph convolutional neural network (GCN) layers and employs an innovative algorithm based on message passing. Unlike conventional machine learning tasks, the model uses hydraulic principles to infer two additional hydraulic state features in the process of reconstructing the available ground truth feature in an unsupervised manner. To the best of our knowledge, this is the first DL approach to emulate the popular hydraulic simulator EPANET, utilizing no additional information. Like most DL models and unlike the hydraulic simulator, our model demonstrates vastly faster emulation times that do not increase drastically with the size of the WDS. Moreover, we achieve high accuracy on the ground truth and very similar results compared to the hydraulic simulator as demonstrated through experiments on five real-world WDS datasets.



Paperid:2432
Authors:Thomas Bailie, Yun Sing Koh, Neelesh Rampal, Peter B. Gibson
The University of Auckland, The University of Auckland, National Institute of Water and Atmospheric Research, National Institute of Water and Atmospheric Research
Abstract:
Global Climate Models (GCMs) simulate low resolution climate projections on a global scale. The native resolution of GCMs is generally too low for societallevel decision-making. To enhance the spatial resolution, downscaling is often applied to GCM output. Statistical downscaling techniques, in particular, are well-established as a cost-effective approach. They require significantly less computational time than physics-based dynamical downscaling. In recent years, deep learning has gained prominence in statistical downscaling, demonstrating significantly lower error rates compared to traditional statistical methods. However, a drawback of regression-based deep learning techniques is their tendency to overfit to the mean sample intensity. Extreme values as a result are often underestimated. Problematically, extreme events have the largest societal impact. We propose Quantile-Regression-Ensemble (QRE), an innovative deep learning algorithm inspired by boosting methods. Its primary objective is to avoid trade-offs between fitting to sample means and extreme values by training independent models on a partitioned dataset. Our QRE is robust to redundant models and not susceptible to explosive ensemble weights, ensuring a reliable training process. QRE achieves lower Mean Squared Error (MSE) compared to various baseline models. In particular, our algorithm has a lower error for high-intensity precipitation events over New Zealand, highlighting the ability to represent extreme events accurately.



Paperid:2433
Authors:Marcel Barros, Andressa Pinto, Andres Monroy, Felipe Moreno, Jefferson Coelho, Aldomar Pietro Silva, Caio Fabricio Deberaldini Netto, José Roberto Leite, Marlon Mathias, Eduardo Tannuri, Artur Jordao, Edson Gomi, Fabio Cozman, Marcelo Dottori, Anna Helena Reali Costa
Universidade de São Paulo, Universidade de São Paulo, Massachusetts Institute of Technology, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo, Universidade de São Paulo
Abstract:
Sealevel rise is a well-known consequence of climate change. Several studies have estimated the social and economic impact of the increase in extreme flooding. An efficient way to mitigate its consequences is the development of a flood alert and prediction system, based on high-resolution numerical models and robust sensing networks. However, current models use various simplifying assumptions that compromise accuracy to ensure solvability within a reasonable timeframe, hindering more regular and cost-effective forecasts for various locations along the shoreline. To address these issues, this work proposes a hybrid model for multimodal data processing that combines physics-based numerical simulations, data obtained from a network of sensors, and satellite images to provide refined wave and sea-surface height forecasts, with real results obtained in a critical location within the Port of Santos (the largest port in Latin America). Our approach exhibits faster convergence than data-driven models while achieving more accurate predictions. Moreover, the model handles irregularly sampled time series and missing data without the need for complex preprocessing mechanisms or data imputation while keeping low computational costs through a combination of time encoding, recurrent and graph neural networks. Enabling raw sensor data to be easily combined with existing physics-based models opens up new possibilities for accurate extreme storm tide events forecast systems that enhance community safety and aid policymakers in their decision-making processes.



Paperid:2434
Authors:Cassidy K. Buhler, Hande Y. Benson
Drexel University, Drexel University
Abstract:
Protected areas (PAs) are designated spaces where human activities are restricted to preserve critical habitats. Decisionmakers are challenged with balancing a trade-off of financial feasibility with ecological benefit when establishing PAs. Given the long-term ramifications of these decisions and the constantly shifting environment, it is crucial that PAs are carefully selected with long-term viability in mind. Using AI tools like simulation and optimization is common for designating PAs, but current decision models are primarily linear. In this paper, we propose a derivative-free optimization framework paired with a nonlinear component, population viability analysis (PVA). Formulated as a mixed integer nonlinear programming (MINLP) problem, our model allows for linear and nonlinear inputs. Connectivity, competition, crowding, and other similar concerns are handled by the PVA software, rather than expressed as constraints of the optimization model. In addition, we present numerical results that serve as a proof of concept, showing our models yield PAs with similar expected risk to that of preserving every parcel in a habitat, but at a significantly lower cost. The overall goal is to promote interdisciplinary work by providing a new mathematical programming tool for conservationists that allows for nonlinear inputs and can be paired with existing ecological software. The code and data are available at https://github.com/cassiebuhler/conservation-dfo.



Paperid:2435
Authors:Bingzhi Chen, Sisi Fu, Yishu Liu, Jiahui Pan, Guangming Lu, Zheng Zhang
South China Normal University Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, South China Normal University, Harbin Institute of Technology, Shenzhen, South China Normal University, Harbin Institute of Technology, Shenzhen Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen
Abstract:
Dental caries has been widely recognized as one of the most prevalent chronic diseases in the field of public health. Despite advancements in automated diagnosis across various medical domains, it remains a substantial challenge for dental caries detection due to its inherent variability and intricacies. To bridge this gap, we release a hospitalscale panoramic dental X-ray benchmark, namely “CariesXrays”, to facilitate the advancements in high-precision computer-aided diagnosis for dental caries. It comprises 6,000 panoramic dental X-ray images, with a total of 13,783 instances of dental caries, all meticulously annotated by dental professionals. In this paper, we propose a novel Feature Pyramid Contrastive Learning (FPCL) framework, that jointly incorporates feature pyramid learning and contrastive learning within a unified diagnostic paradigm for automated dental caries detection. Specifically, a robust dual-directional feature pyramid network (D2D-FPN) is designed to adaptively capture rich and informative contextual information from multi-level feature maps, thus enhancing the generalization ability of caries detection across different scales. Furthermore, our model is augmented with an effective proposals-prototype contrastive regularization learning (P2P-CRL) mechanism, which can flexibly bridge the semantic gaps among diverse dental caries with varying appearances, resulting in high-quality dental caries proposals. Extensive experiments on our newly-established CariesXrays benchmark demonstrate the potential of FPCL to make a significant social impact on caries diagnosis.



Paperid:2436
Authors:Weiye Chen, Yiqun Xie, Xiaowei Jia, Erhu He, Han Bao, Bang An, Xun Zhou
University of Maryland, University of Maryland, University of Pittsburgh, University of Pittsburgh, University of Iowa, University of Iowa, University of Iowa
Abstract:
When dealing with data from distinct locations, machine learning algorithms tend to demonstrate an implicit preference of some locations over the others, which constitutes biases that sabotage the spatial fairness of the algorithm. This unfairness can easily introduce biases in subsequent decisionmaking given broad adoptions of learning-based solutions in practice. However, locational biases in AI are largely understudied. To mitigate biases over locations, we propose a locational meta-referee (Meta-Ref) to oversee the few-shot meta-training and meta-testing of a deep neural network. Meta-Ref dynamically adjusts the learning rates for training samples of given locations to advocate a fair performance across locations, through an explicit consideration of locational biases and the characteristics of input data. We present a three-phase training framework to learn both a meta-learning-based predictor and an integrated Meta-Ref that governs the fairness of the model. Once trained with a distribution of spatial tasks, Meta-Ref is applied to samples from new spatial tasks (i.e., regions outside the training area) to promote fairness during the fine-tune step. We carried out experiments with two case studies on crop monitoring and transportation safety, which show Meta-Ref can improve locational fairness while keeping the overall prediction quality at a similar level.



Paperid:2437
Authors:Yuhan Chen, Nuwa Xi, Yanrui Du, Haochun Wang, Jianyu Chen, Sendong Zhao, Bing Qin
Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology, Harbin Institute of Technology
Abstract:
Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of insilico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.



Paperid:2438
Authors:Zirong Chen, Xutong Sun, Yuanhe Li, Meiyi Ma
Vanderbilt University, Nashville, TN, Vanderbilt University, Nashville, TN, Vanderbilt University, Nashville, TN, Vanderbilt University, Nashville, TN
Abstract:
Emergency and nonemergency response systems are essential services provided by local governments and critical to protecting lives, the environment, and property. The effective handling of (non-)emergency calls is critical for public safety and well-being. By reducing the burden through non-emergency callers, residents in critical need of assistance through 911 will receive a fast and effective response. Collaborating with the Department of Emergency Communications (DEC) in Nashville, we analyzed 11,796 non-emergency call recordings and developed Auto311, the first automated system to handle 311 non-emergency calls, which (1) effectively and dynamically predicts ongoing non-emergency incident types to generate tailored case reports during the call; (2) itemizes essential information from dialogue contexts to complete the generated reports; and (3) strategically structures system-caller dialogues with optimized confidence. We used real-world data to evaluate the system's effectiveness and deployability. The experimental results indicate that the system effectively predicts incident type with an average F-1 score of 92.54%. Moreover, the system successfully itemizes critical information from relevant contexts to complete reports, evincing a 0.93 average consistency score compared to the ground truth. Additionally, emulations demonstrate that the system effectively decreases conversation turns as the utterance size gets more extensive and categorizes the ongoing call with 94.49% mean accuracy.



Paperid:2439
Authors:Hyunmin Choi, Simon S. Woo, Hyoungshick Kim
NAVER Cloud, South Korea Department of Computer Science and Engineering, Sungkyunkwan University, South Korea, Department of Artificial Intelligence, Sungkyunkwan University, South Korea Department of Computer Science and Engineering, Sungkyunkwan University, South Korea, Department of Computer Science and Engineering, Sungkyunkwan University, South Korea
Abstract:
Fingerprint authentication is a popular security mechanism for smartphones and laptops. However, its adoption in web and cloud environments has been limited due to privacy concerns over storing and processing biometric data on servers. This paper introduces BlindTouch, a novel machine learning-based fingerprint authentication system leveraging homomorphic encryption to address these privacy concerns. Homomorphic encryption allows computations on encrypted data without decrypting. Thus, Blind-Touch can keep fingerprint data encrypted on the server while performing machine learning operations. Blind-Touch combines three strategies to efficiently utilize homomorphic encryption in machine learning: (1) It optimizes the feature vector for a distributed architecture, processing the first fully connected layer (FC-16) in plaintext on the client side and the subsequent layer (FC-1) post-encryption on the server, thereby minimizing encrypted computations; (2) It employs a homomorphic encryption-compatible data compression technique capable of handling 8,192 authentication results concurrently; and (3) It utilizes a clustered server architecture to simultaneously process authentication results, thereby enhancing scalability with increasing user numbers. Blind-Touch achieves high accuracy on two benchmark fingerprint datasets, with a 93.6% F1- score for the PolyU dataset and a 98.2% F1-score for the SOKOTO dataset. Moreover, Blind-Touch can match a fingerprint among 5,000 in about 0.65 seconds. With its privacy-focused design, high accuracy, and efficiency, Blind-Touch is a promising alternative to conventional fingerprint authentication for web and cloud applications.



Paperid:2440
Authors:Jong in Choi, Won Kyung Lee, Jae Hwan Lee, So Young Sohn
Yonsei University Finance AI Data Center, Korea Credit Information Services, Yonsei University The Hartree Centre, STFC Laboratory, Yonsei University, Yonsei University
Abstract:
Most countries provide veterans with various benefits to reward their sacrifice. Unfortunately, many veterans have failed to prove their status due to loss of military records. Thus, some governments allow the verification of those veterans through "buddy statements" obtained from the people who can vouch for the buddy's participation in the war. However, it is still challenging for veterans to find guarantors directly. With this background, we suggest to utilizing historical war records of combined operations to increase the pool of potential guarantors for the buddy statements. However, a combined operation network among troops can have missing edges and perturbations on attributes of the troop due to inaccurate information. In this study, we learn from some recorded interactions which might be incomplete and noisy, and predict missing linkages among the troops that might have interacted together in the war, by proposing RobustSEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction). It combines two Graph Neural Network (GNN) architectures: robust Graph Convolutional Network which considers the uncertainty of node attributes with a probabilistic approach, and SEAL which improves the expressive power of the GNN with a labeling trick. Our proposed approach was applied to Korean War data with perturbations. For experimentations, we hid some actual interactions and found that Robust-SEAL restores missing interactions better than other GNN-based baselines.



Paperid:2441
Authors:Yujin Choi, Jinseong Park, Hoki Kim, Jaewook Lee, Saerom Park
Seoul National University, Seoul National University, Seoul National University, Seoul National University, Ulsan National Institute of Science and Technology
Abstract:
Diffusion models have shown their effectiveness in generation tasks by wellapproximating the underlying probability distribution. However, diffusion models are known to suffer from an amplified inherent bias from the training data in terms of fairness. While the sampling process of diffusion models can be controlled by conditional guidance, previous works have attempted to find empirical guidance to achieve quantitative fairness. To address this limitation, we propose a fairness-aware sampling method called \textit{attribute switching} mechanism for diffusion models. Without additional training, the proposed sampling can obfuscate sensitive attributes in generated data without relying on classifiers. We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects: (i) the generation of fair data and (ii) the preservation of the utility of the generated data.



Paperid:2442
Authors:A. Feder Cooper, Katherine Lee, Madiha Zahrah Choksi, Solon Barocas, Christopher De Sa, James Grimmelmann, Jon Kleinberg, Siddhartha Sen, Baobao Zhang
The Center for Generative AI, Law, and Policy Research Cornell University, The Center for Generative AI, Law, and Policy Research Cornell University, Cornell University, Cornell University Microsoft Research, Cornell University, The Center for Generative AI, Law, and Policy Research Cornell University, Cornell University, Microsoft Research, Syracuse University
Abstract:
Variance in predictions across different trained models is a significant, underexplored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions. We: 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair binary classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fair binary classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply any fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should reconsider how we choose to measure fairness in binary classification.



Paperid:2443
Authors:Saswat Das, Keyu Zhu, Christine Task, Pascal Van Hentenryck, Ferdinando Fioretto
University of Virginia, Georgia Tech, Knexus Research, Georgia Institute of Technology, University of Virginia
Abstract:
This paper analyzes the privacy of traditional Statistical Disclosure Control (SDC) systems under a differential privacy interpretation. SDCs, such as cell suppression and swapping, promise to safeguard the confidentiality of data and are routinely adopted in data analyses with profound societal and economic impacts. Through a formal analysis and empirical evaluation of demographic data from real households in the U.S., the paper shows that widely adopted SDC systems not only induce vastly larger privacy losses than classical differential privacy mechanisms, but, they may also come at a cost of larger accuracy and fairness.



Paperid:2444
Authors:Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak, Birju Patel, Chia-Chun Chiang, Alison Callahan, Zepeng Huo, Sergios Gatidis, Scott Adams, Oluseyi Fayanju, Shreya J. Shah, Thomas Savage, Ethan Goh, Akshay S. Chaudhari, Nima Aghaeepour, Christopher Sharp, Michael A. Pfeffer, Percy Liang, Jonathan H. Chen, Keith E. Morse, Emma P. Brunskill, Jason A. Fries, Nigam H. Shah
Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA, Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA, Department of Anesthesiology, Peri-operative, and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA, Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA, Department of Radiology, Stanford School of Medicine, Stanford, CA, USA Center for Artificial Intelligence in Medicine and Imaging (AIMI), Stanford University, Stanford, CA, USA Hospital Israelita Albert Einstein, Sao Paulo, SP, Brazil, Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA, Department of Electrical Engineering, Stanford School of Engineering, Stanford, CA, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA, Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA, Department of Medicine, Stanford School of Medicine, Stanford, CA, USA, Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA, Department of Neurology, Mayo Clinic, Rochester, MN, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA, Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA, Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA, Department of Radiology, Stanford School of Medicine, Stanford, CA, USA, Department of Radiology, Stanford School of Medicine, Stanford, CA, USA, Department of Medicine, Stanford School of Medicine, Stanford, CA, USA, Department of Medicine, Stanford School of Medicine, Stanford, CA, USA, Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Division of Hospital Medicine, Stanford University, Stanford, CA, USA, Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA, Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Radiology, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA, Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA, USA Department of Anesthesiology, Peri-operative, and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA, Department of Medicine, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA, Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA, Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA, Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA Division of Hospital Medicine, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA, Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA, Department of Computer Science, Stanford School of Engineering, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA, Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA, Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA, USA Department of Medicine, Stanford School of Medicine, Stanford, CA, USA Human-Centered Artificial Intelligence Institute, Stanford University, Stanford, CA, USA Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA
Abstract:
The ability of large language models (LLMs) to follow natural language instructions with humanlevel fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.



Paperid:2445
Authors:Akash Ghosh, Arkadeep Acharya, Raghav Jain, Sriparna Saha, Aman Chadha, Setu Sinha
Department of Computer Science And Engineering, Indian Institute of Technology Patna, India, Department of Computer Science And Engineering, Indian Institute of Technology Patna, India, Department of Computer Science And Engineering, Indian Institute of Technology Patna, India, Department of Computer Science And Engineering, Indian Institute of Technology Patna, India, Stanford University Amazon AI., Indira Gandhi Institute of Medical Sciences
Abstract:
In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on textbased summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution of our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care. Disclaimer: The article features graphic medical imagery, a result of the subject's inherent requirements.



Paperid:2446
Authors:Soumitra Ghosh, Gopendra Vikram Singh, Jashn Arora, Asif Ekbal
Fondazione Bruno Kessler- FBK, Trento, Indian Institute of Technology, Patna, International Institute of Information Technology, Hyderabad, Indian Institute of Technology, Patna
Abstract:
In the digital age, cybercrimes, particularly cyber harassment, have become pressing issues, targeting vulnerable individuals like children, teenagers, and women. Understanding the experiences and needs of the victims is crucial for effective support and intervention. Online conversations between victims and virtual harassment counselors (chatbots) offer valuable insights into cyber harassment manifestations (CHMs) and determinants (CHDs). However, the distinction between CHMs and CHDs remains unclear. This research is the first to introduce concrete definitions for CHMs and CHDs, investigating their distinction through automated methods to enable efficient cyberharassment dialogue comprehension. We present a novel dataset, Cyber-MaD that contains Cyber harassment dialogues manually annotated with Manifestations and Determinants. Additionally, we design an Emotion-informed Contextual Dual attention Convolution Transformer (E-ConDuCT) framework to extract CHMs and CHDs from cyber harassment dialogues. The framework primarily: a) utilizes inherent emotion features through adjective-noun pairs modeled by an autoencoder, b) employs a unique Contextual Dual attention Convolution Transformer to learn contextual insights; and c) incorporates a demarcation module leveraging task-specific emotional knowledge and a discriminator loss function to differentiate manifestations and determinants. E-ConDuCT outperforms the state-of-the-art systems on the Cyber-MaD corpus, showcasing its potential in the extraction of CHMs and CHDs. Furthermore, its robustness is demonstrated on the emotion cause extraction task using the CARES_CEASE-v2.0 dataset of suicide notes, confirming its efficacy across diverse cause extraction objectives. Access the code and data at 1. https://www.iitp.ac.in/~ai-nlp-ml/resources.html#E-ConDuCT-on-Cyber-MaD, 2. https://github.com/Soumitra816/Manifestations-Determinants.



Paperid:2447
Authors:Shadan Golestan, Omid Ardakanian, Pierre Boulanger
University of Alberta, University of Alberta, University of Alberta
Abstract:
Optimizing the configuration and placement of sensors is crucial for reliable fall detection, indoor localization, and activity recognition in assisted living spaces. We propose a novel, sampleefficient approach to find a high-quality sensor placement in an arbitrary indoor space based on grey-box Bayesian optimization and simulation-based evaluation. Our key technical contribution lies in capturing domain-specific knowledge about the spatial distribution of activities and incorporating it into the iterative selection of query points in Bayesian optimization. Considering two simulated indoor environments and a real-world dataset containing human activities and sensor triggers, we show that our proposed method performs better compared to state-of-the-art black-box optimization techniques in identifying high-quality sensor placements, leading to an accurate activity recognition model in terms of F1-score, while also requiring a significantly lower (51.3% on average) number of expensive function queries.



Paperid:2448
Authors:Xuan Gong, Shanglin Li, Yuxiang Bao, Barry Yao, Yawen Huang, Ziyan Wu, Baochang Zhang, Yefeng Zheng, David Doermann
University at Buffalo, Buffalo, NY, USA Harvard Medical School, Boston, MA, USA, Institute of Artificial Intelligence, Hangzhou Research Institute, Beihang University, Beijing, China, Institute of Artificial Intelligence, Hangzhou Research Institute, Beihang University, Beijing, China, University at Buffalo, Buffalo, NY, USA Virginia Tech, Blacksburg, VA, USA, Jarvis Research Center, Tencent YouTu Lab, Shenzhen, China, United Imaging Intelligence, Burlington, MA, USA, Institute of Artificial Intelligence, Hangzhou Research Institute, Beihang University, Beijing, China Zhongguancun Laboratory, Beijing, China Nanchang Institute of Technology, Nanchang, China, Jarvis Research Center, Tencent YouTu Lab, Shenzhen, China, University at Buffalo, Buffalo, NY, USA
Abstract:
Federated learning (FL) is a machine learning paradigm in which distributed local nodes collaboratively train a central model without sharing individually held private data. Existing FL methods either iteratively share local model parameters or deploy codistillation. However, the former is highly susceptible to private data leakage, and the latter design relies on the prerequisites of task-relevant real data. Instead, we propose a data-free FL framework based on local-to-central collaborative distillation with direct input and output space exploitation. Our design eliminates any requirement of recursive local parameter exchange or auxiliary task-relevant data to transfer knowledge, thereby giving direct privacy control to local users. In particular, to cope with the inherent data heterogeneity across locals, our technique learns to distill input on which each local model produces consensual yet unique results to represent each expertise. Our proposed FL framework achieves notable privacy-utility trade-offs with extensive experiments on image classification and segmentation tasks under various real-world heterogeneous federated learning settings on both natural and medical images. Code is available at https://github.com/lsl001006/FedIOD.



Paperid:2449
Authors:Marc Grimson, Rafael Almeida, Qinru Shi, Yiwei Bai, Héctor Angarita, Felipe Siqueira Pacheco, Rafael Schmitt, Alexander Flecker, Carla P. Gomes
Cornell University, University of Texas Rio Grande Valley, Cornell University, Cornell University, Stanford University, Cornell University, Stanford University, Cornell University, Cornell University
Abstract:
Sustainability challenges inherently involve the consideration of multiple competing objectives. The Pareto frontier – the set of all optimal solutions that cannot be improved with respect to one objective without negatively affecting another – is a crucial decisionmaking tool for navigating sustainability challenges as it highlights the inherent trade-offs among conflicting objectives. Our research is motivated by the strategic planning of hydropower in the Amazon basin, one of the earth’s largest and most biodiverse river systems, where the need to increase energy production coincides with the pressing requirement of minimizing detrimental environmental impacts. We investigate an innovative strategy that pairs hydropower with Floating Photovoltaic Solar Panels (FPV). We provide a new extended multi-tree network formulation, which enables the consideration of multiple dam configurations. To address the computational challenge of scaling up the Pareto optimization framework to tackle multiple objectives across the entire Amazon basin, we further enhance the state-of-the-art algorithm for Pareto frontiers in tree-structured networks with two improvements. We introduce affine transformations induced by the sub-frontiers to compute Pareto dominance and provide strategies for merging sub-trees, significantly increasing the pruning of dominated solutions. Our experiments demonstrate considerable speedups, in some cases by more than an order of magnitude, while maintaining optimality guarantees, thus allowing us to more effectively approximate the Pareto frontiers. Moreover, our findings suggest significant shifts towards higher energy values in the Pareto frontier when pairing hybrid hydropower with FPV solutions, potentially amplifying energy production while mitigating adverse impacts.



Paperid:2450
Authors:Parian Haghighat, Denisa Gándara, Lulu Kang, Hadis Anahideh
University of Illinois at Chicago, The University of Texas at Austin, Department of Mathematics and Statistics, University of Massachusetts Amherst, University of Illinois Chicago
Abstract:
Predictive analytics has been widely used in various domains, including education, to inform decisionmaking and improve outcomes. However, many predictive models are proprietary and inaccessible for evaluation or modification by researchers and practitioners, limiting their accountability and ethical design. Moreover, predictive models are often opaque and incomprehensible to the officials who use them, reducing their trust and utility. Furthermore, predictive models may introduce or exacerbate bias and inequity, as they have done in many sectors of society. Therefore, there is a need for transparent, interpretable, and fair predictive models that can be easily adopted and adapted by different stakeholders. In this paper, we propose a fair predictive model based on multivariate adaptive regression splines (MARS) that incorporates fairness measures in the learning process. MARS is a non-parametric regression model that performs feature selection, handles non-linear relationships, generates interpretable decision rules, and derives optimal splitting criteria on the variables. Specifically, we integrate fairness into the knot optimization algorithm and provide theoretical and empirical evidence of how it results in a fair knot placement. We apply our fairMARS model to real-world data and demonstrate its effectiveness in terms of accuracy and equity. Our paper contributes to the advancement of responsible and ethical predictive analytics for social good.



Paperid:2451
Authors:Erhu He, Yiqun Xie, Alexander Sun, Jacob Zwart, Jie Yang, Zhenong Jin, Yang Wang, Hassan Karimi, Xiaowei Jia
University of Pittsburgh, The University of Maryland, The University of Texas at Austin, U.S. geological survey, University of Minnesota, University of Minnesota, University of Pittsburgh, University of Pittsburgh, University of Pittsburgh
Abstract:
Accurate prediction of water quality and quantity is crucial for sustainable development and human wellbeing. However, existing data-driven methods often suffer from spatial biases in model performance due to heterogeneous data, limited observations, and noisy sensor data. To overcome these challenges, we propose Fair-Graph, a novel graph-based recurrent neural network that leverages interrelated knowledge from multiple rivers to predict water flow and temperature within large-scale stream networks. Additionally, we introduce node-specific graph masks for information aggregation and adaptation to enhance prediction over heterogeneous river segments. To reduce performance disparities across river segments, we introduce a centralized coordination strategy that adjusts training priorities for segments. We evaluate the prediction of water temperature within the Delaware River Basin, and the prediction of streamflow using simulated data from U.S. National Water Model in the Houston River network. The results showcase improvements in predictive performance and highlight the proposed model's ability to maintain spatial fairness over different river segments.



Paperid:2452
Authors:Liam Hebert, Gaurav Sahu, Yuxuan Guo, Nanda Kishore Sreenivas, Lukasz Golab, Robin Cohen
University of Waterloo, University of Waterloo, University of Waterloo, University of Waterloo, University of Waterloo, University of Waterloo
Abstract:
We present the MultiModal Discussion Transformer (mDT), a novel method for detecting hate speech on online social networks such as Reddit discussions. In contrast to traditional comment-only methods, our approach to labelling a comment as hate speech involves a holistic analysis of text and images grounded in the discussion context. This is done by leveraging graph transformers to capture the contextual relationships in the discussion surrounding a comment and grounding the interwoven fusion layers that combine text and image embeddings instead of processing modalities separately. To evaluate our work, we present a new dataset, HatefulDiscussions, comprising complete multi-modal discussions from multiple online communities on Reddit. We compare the performance of our model to baselines that only process individual comments and conduct extensive ablation studies.



Paperid:2453
Authors:Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, Peng Qi
Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Centre for Trusted Internet and Community, National University of Singapore
Abstract:
Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the realworld background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM's inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs' rationales. We further derive a rationale-free version of ARG by distillation, namely ARG-D, which services cost-sensitive scenarios without inquiring LLMs. Experiments on two real-world datasets demonstrate that ARG and ARG-D outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models.



Paperid:2454
Authors:Yaowei Hu, Yongkai Wu, Lu Zhang
University of Arkansas, Clemson University, University of Arkansas
Abstract:
This paper studies longterm fair machine learning which aims to mitigate group disparity over the long term in sequential decision-making systems. To define long-term fairness, we leverage the temporal causal graph and use the 1-Wasserstein distance between the interventional distributions of different demographic groups at a sufficiently large time step as the quantitative metric. Then, we propose a three-phase learning framework where the decision model is trained on high-fidelity data generated by a deep generative model. We formulate the optimization problem as a performative risk minimization and adopt the repeated gradient descent algorithm for learning. The empirical evaluation shows the efficacy of the proposed method using both synthetic and semi-synthetic datasets.



Paperid:2455
Authors:Tianyuan Huang, Zejia Wu, Jiajun Wu, Jackelyn Hwang, Ram Rajagopal
Stanford University, University of California, San Diego, Stanford University, Stanford University, Stanford University
Abstract:
Urban transformations have profound societal impact on both individuals and communities at large. Accurately assessing these shifts is essential for understanding their underlying causes and ensuring sustainable urban planning. Traditional measurements often encounter constraints in spatial and temporal granularity, failing to capture realtime physical changes. While street view imagery, capturing the heartbeat of urban spaces in a pedestrian point of view, can add as a high-definition, up-to-date, and on-the-ground visual proxy of urban change. We curate the largest street view time series dataset to date, and propose an end-to-end change detection model to effectively capture physical alterations in the built environment at scale. We demonstrate the effectiveness of our proposed method by benchmark comparisons with previous literature and implementing it at the city-wide level. Our approach has the potential to supplement existing dataset and serve as a fine-grained and accurate assessment of urban change.



Paperid:2456
Authors:Xu Huang, Chuyao Luo, Bowen Zhang, Huiwei Lin, Xutao Li, Yunming Ye
Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Shenzhen Technology University, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen, Harbin Institute of Technology, Shenzhen
Abstract:
Accurate prediction of meteorological elements, such as temperature and relative humidity, is important to human livelihood, early warning of extreme weather, and urban governance. Recently, neural networkbased methods have shown impressive performance in this field. However, most of them are overcomplicated and impenetrable. In this paper, we propose a straightforward and interpretable differential framework, where the key lies in explicitly estimating the evolutionary trends. Specifically, three types of trends are exploited. (1) The proximity trend simply uses the most recent changes. It works well for approximately linear evolution. (2) The sequential trend explores the global information, aiming to capture the nonlinear dynamics. Here, we develop an attention-based trend unit to help memorize long-term features. (3) The flow trend is motivated by the nature of evolution, i.e., the heat or substance flows from one region to another. Here, we design a flow-aware attention unit. It can reflect the interactions via performing spatial attention over flow maps. Finally, we develop a trend fusion module to adaptively fuse the above three trends. Extensive experiments on two datasets demonstrate the effectiveness of our method.



Paperid:2457
Authors:Sameer Jain, Sedrick Scott Keh, Shova Chhetri, Karun Dewan, Pablo Izquierdo, Johanna Prussmann, Pooja Shrestha, César Suárez, Zheyuan Ryan Shi, Lei Li, Fei Fang
Carnegie Mellon University, Carnegie Mellon University, World Wide Fund for Nature, World Wide Fund for Nature, World Wide Fund for Nature, World Wide Fund for Nature, World Wide Fund for Nature, World Wide Fund for Nature, University of Pittsburgh, Carnegie Mellon University, Carnegie Mellon University
Abstract:
Environmental conservation organizations routinely monitor news content on conservation in protected areas to maintain situational awareness of developments that can have an environmental impact. Existing automated media monitoring systems require large amounts of data labeled by domain experts, which is only feasible at scale for highresource languages like English. However, such tools are most needed in the global south where the news of interest is mainly in local low-resource languages, and far fewer experts are available to annotate datasets on a sustainable basis. In this paper, we propose NewsSerow, a method to automatically recognize environmental conservation content in low-resource languages. NewsSerow is a pipeline of summarization, in-context few-shot classification, and self-reflection using large language models (LLMs). Using at most 10 demonstration example news articles in Nepali, NewsSerow significantly outperforms other few-shot methods and can achieve comparable performance with models fully fine-tuned using thousands of examples. With NewsSerow, Organization X has been able to deploy the media monitoring tool in Nepal, significantly reducing their operational burden, and ensuring that AI tools for conservation actually reach the communities that need them the most. NewsSerow has also been deployed for countries with other languages like Colombia.



Paperid:2458
Authors:Doseok Jang, Larry Yan, Lucas Spangher, Costas J. Spanos
University of California, Berkeley, University of California, Berkeley, University of California, Berkeley, University of California, Berkeley
Abstract:
Reinforcement learning (RL) is a powerful tool for optimal control that has found great success in Atari games, the game of Go, robotic control, and building optimization. RL is also very brittle; agents often overfit to their training environment and fail to generalize to new settings. Unsupervised environment design (UED) has been proposed as a solution to this problem, in which the agent trains in environments that have been specially selected to help it learn. Previous UED algorithms focus on trying to train an RL agent that generalizes across a large distribution of environments. This is not necessarily desirable when we wish to prioritize performance in one environment over others. In this work, we will be examining the setting of robust RL building control, where we wish to train an RL agent that prioritizes performing well in normal weather while still being robust to extreme weather conditions. We demonstrate a novel UED algorithm, ActivePLR, that uses uncertaintyaware neural network architectures to generate new training environments at the limit of the RL agent's ability while being able to prioritize performance in a desired base environment. We show that ActivePLR is able to outperform state-of-the-art UED algorithms in minimizing energy usage while maximizing occupant comfort in the setting of building control.



Paperid:2459
Authors:Taeuk Jang, Xiaoqian Wang, Heng Huang
Purdue University, Purdue University, University of Maryland at College Park
Abstract:
Fairness is becoming a rising concern in machine learning. Recent research has discovered that stateof-the-art models are amplifying social bias by making biased prediction towards some population groups (characterized by sensitive features like race or gender). Such unfair prediction among groups renders trust issues and ethical concerns in machine learning, especially for sensitive fields such as employment, criminal justice, and trust score assessment. In this paper, we introduce a new framework to improve machine learning fairness. The goal of our model is to minimize the influence of sensitive feature from the perspectives of both data input and predictive model. To achieve this goal, we reformulate the data input by eliminating the sensitive information and strengthen model fairness by minimizing the marginal contribution of the sensitive feature. We propose to learn the sensitive-irrelevant input via sampling among features and design an adversarial network to minimize the dependence between the reformulated input and the sensitive information. Empirical results validate that our model achieves comparable or better results than related state-of-the-art methods w.r.t. both fairness metrics and prediction performance.



Paperid:2460
Authors:Xinwei Ji, Xiaomin Chang, Wei Li, Albert Y. Zomaya
The University of Sydney, The University of Sydney, The University of Sydney, The University of Sydney
Abstract:
Pain, a primary reason for seeking medical help, requires essential pain assessment for effective management. Studies have recognized electrodermal activity (EDA) signaling's potential for automated pain assessment, but traditional algorithms often ignore the noise and uncertainty inherent in pain data. To address this, we propose a learning framework predicated on data uncertainty, introducing two forms: a) subjectlevel stimulation-reaction drift; b) ambiguity in self-reporting scores. We formulate an uncertainty assessment using Heart Rate Variability (HRV) features to guide the selection of responsive pain profiles and reweight subtask importance based on the vagueness of self-reported data. These methods are integrated within an end-to-end neural network learning paradigm, focusing the detector on more accurate insights within the uncertainty domain. Extensive experimentation on both the publicly available biovid dataset and the proprietary Apon dataset demonstrates our approach's effectiveness. In the biovid dataset, we achieved a 6% enhancement over the state-of-the-art methodology, and on the Apon dataset, our method outperformed baseline approaches by over 20%.



Paperid:2461
Authors:Ananya Joshi, Tina Townes, Nolan Gormley, Luke Neureiter, Roni Rosenfeld, Bryan Wilder
Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University
Abstract:
Disease control experts inspect public health data streams daily for outliers worth investigating, like those corresponding to data quality issues or disease outbreaks. However, they can only examine a few of the thousands of maximallytied outliers returned by univariate outlier detection methods applied to large-scale public health data streams. To help experts distinguish the most important outliers from these thousands of tied outliers, we propose a new task for algorithms to rank the outputs of any univariate method applied to each of many streams. Our novel algorithm for this task, which leverages hierarchical networks and extreme value analysis, performed the best across traditional outlier detection metrics in a human-expert evaluation using public health data streams. Most importantly, experts have used our open-source Python implementation since April 2023 and report identifying outliers worth investigating 9.1x faster than their prior baseline. Other organizations can readily adapt this implementation to create rankings from the outputs of their tailored univariate methods across large-scale streams.



Paperid:2462
Authors:Opadele Kehinde, Ruth Abdul, Bose Afolabi, Parminder Vir, Corinne Namblard, Ayan Mukhopadhyay, Abiodun Adereni
HelpMum, HelpMum, HelpMum, HelpMum, HelpMum, Vanderbilt University, HelpMum
Abstract:
More than 5 million children under five years die from largely preventable or treatable medical conditions every year, with an overwhelmingly large proportion of deaths occurring in underdeveloped countries with low vaccination uptake. One of the United Nations' sustainable development goals (SDG 3) aims to end preventable deaths of newborns and children under five years of age. We focus on Nigeria, where the rate of infant mortality is appalling. In particular, low vaccination uptake in Nigeria is a major driver of more than 2,000 daily deaths of children under the age of five years. In this paper, we describe our collaboration with government partners in Nigeria to deploy ADVISER: AIDriven Vaccination Intervention Optimiser. The framework, based on an integer linear program that seeks to maximize the cumulative probability of successful vaccination, is the first successful deployment of an AI-enabled toolchain for optimizing the allocation of health interventions in Nigeria. In this paper, we provide a background of the ADVISER framework and present results, lessons, and success stories of deploying ADVISER to more than 13,000 families in the state of Oyo, Nigeria.



Paperid:2463
Authors:Astrid Klipfel, Yaël Fregier, Adlane Sayede, Zied Bouraoui
Univ. Artois, UMR 8188, Centre de Recherche en Informatique de Lens (CRIL), F-62300 Lens, France. Univ. Artois, UR 2462, Laboratoire de Mathématiques de Lens (LML), F-62300 Lens, France. Univ. Artois, UMR 8181, Unité de Catalyse et de Chimie du Solide (UCCS), F-62300 Lens, France., Univ. Artois, UR 2462, Laboratoire de Mathématiques de Lens (LML), F-62300 Lens, France., Univ. Artois, UMR 8181, Unité de Catalyse et de Chimie du Solide (UCCS), F-62300 Lens, France., Univ. Artois, UMR 8188, Centre de Recherche en Informatique de Lens (CRIL), F-62300 Lens, France.
Abstract:
Discovering crystal structures with specific chemical properties has become an increasingly important focus in material science. However, current models are limited in their ability to generate new crystal lattices, as they only consider atomic positions or chemical composition. To address this issue, we propose a probabilistic diffusion model that utilizes a geometrically equivariant GNN to consider atomic positions and crystal lattices jointly. To evaluate the effectiveness of our model, we introduce a new generation metric inspired by Frechet Inception Distance, but based on GNN energy prediction rather than InceptionV3 used in computer vision. In addition to commonly used metrics like validity, which assesses the plausibility of a structure, this new metric offers a more comprehensive evaluation of our model's capabilities. Our experiments on existing benchmarks show the significance of our diffusion model. We also show that our method can effectively learn meaningful representations.



Paperid:2464
Authors:Jordi Laguarta Soler, Thomas Friedel, Sherrie Wang
Massachusetts Institute of Technology, Cambridge, MA, USA, PEAT GmbH, Berlin, Germany, Massachusetts Institute of Technology, Cambridge, MA, USA
Abstract:
Accurate crop type maps are an essential source of information for monitoring yield progress at scale, projecting global crop production, and planning effective policies. To date, however, crop type maps remain challenging to create in lowand middle-income countries due to a lack of ground truth labels for training machine learning models. Field surveys are the gold standard in terms of accuracy but require an often-prohibitively large amount of time, money, and statistical capacity. In recent years, street-level imagery, such as Google Street View, KartaView, and Mapillary, has become available around the world. Such imagery contains rich information about crop types grown at particular locations and times. In this work, we develop an automated system to generate crop type ground references using deep learning and Google Street View imagery. The method efficiently curates a set of street-view images containing crop fields, trains a model to predict crop types using either weakly-labeled images from disparate out-of-domain sources or zero-shot labeled street view images with GPT-4V, and combines the predicted labels with remote sensing time series to create a wall-to-wall crop type map. We show that, in Thailand, the resulting country-wide map of rice, cassava, maize, and sugarcane achieves an accuracy of 93%. We publicly release the first-ever crop type map for all of Thailand 2022 at 10m-resolution with no gaps. To our knowledge, this is the first time a 10m-resolution, multi-crop map has been created for any smallholder country. As the availability of roadside imagery expands, our pipeline provides a way to map crop types at scale around the globe, especially in underserved smallholder regions.



Paperid:2465
Authors:Yansheng Li, Bo Dang, Wanchun Li, Yongjun Zhang
Wuhan University, Wuhan University, Wuhan University, Wuhan University
Abstract:
Global surface water detection in veryhigh-resolution (VHR) satellite imagery can directly serve major applications such as refined flood mapping and water resource assessment. Although achievements have been made in detecting surface water in small-size satellite images corresponding to local geographic scales, datasets and methods suitable for mapping and analyzing global surface water have yet to be explored. To encourage the development of this task and facilitate the implementation of relevant applications, we propose the GLH-water dataset that consists of 250 satellite images and 40.96 billion pixels labeled surface water annotations that are distributed globally and contain water bodies exhibiting a wide variety of types (e.g. , rivers, lakes, and ponds in forests, irrigated fields, bare areas, and urban areas). Each image is of the size 12,800 × 12,800 pixels at 0.3 meter spatial resolution. To build a benchmark for GLH-water, we perform extensive experiments employing representative surface water detection models, popular semantic segmentation models, and ultra-high resolution segmentation models. Furthermore, we also design a strong baseline with the novel pyramid consistency loss (PCL) to initially explore this challenge, increasing IoU by 2.4% over the next best baseline. Finally, we implement the cross-dataset generalization and pilot area application experiments, and the superior performance illustrates the strong generalization and practical application value of GLH-water dataset. Project page: https://jack-bo1220.github.io/project/GLH-water.html



Paperid:2466
Authors:Bo Lin, Shoshanna Saxe, Timothy C. Y. Chan
University of Toronto, University of Toronto, University of Toronto
Abstract:
Cycling stress assessment, which quantifies cyclists' perceived stress imposed by the built environment and motor traffics, increasingly informs cycling infrastructure planning and cycling route recommendation. However, currently calculating cycling stress is slow and dataintensive, which hinders its broader application. In this paper, We propose a deep learning framework to support accurate, fast, and large-scale cycling stress assessments for urban road networks based on street-view images. Our framework features i) a contrastive learning approach that leverages the ordinal relationship among cycling stress labels, and ii) a post-processing technique that enforces spatial smoothness into our predictions. On a dataset of 39,153 road segments collected in Toronto, Canada, our results demonstrate the effectiveness of our deep learning framework and the value of using image data for cycling stress assessment in the absence of high-quality road geometry and motor traffic data.



Paperid:2467
Authors:Han Liu, Changya Li, Xiaotong Zhang, Feng Zhang, Wei Wang, Fenglong Ma, Hongyang Chen, Hong Yu, Xianchao Zhang
Dalian University of Technology, Dalian University of Technology, Dalian University of Technology, Peking University, Shenzhen MSU-BIT University, The Pennsylvania State University, Zhejiang Lab, Dalian University of Technology, Dalian University of Technology
Abstract:
Depression detection is a challenging and crucial task in psychological illness diagnosis. Utilizing online user posts to predict whether a user suffers from depression seems an effective and promising direction. However, existing methods suffer from either poor interpretability brought by the blackbox models or underwhelming performance caused by the completely separate two-stage model structure. To alleviate these limitations, we propose a novel capsule network integrated with contrastive learning for depression detection (DeCapsNet). The highlights of DeCapsNet can be summarized as follows. First, it extracts symptom capsules from user posts by leveraging meticulously designed symptom descriptions, and then distills them into class-indicative depression capsules. The overall workflow is in an explicit hierarchical reasoning manner and can be well interpreted by the Patient Health Questionnaire-9 (PHQ9), which is one of the most widely adopted questionnaires for depression diagnosis. Second, it integrates with contrastive learning, which can facilitate the embeddings from the same class to be pulled closer, while simultaneously pushing the embeddings from different classes apart. In addition, by adopting the end-to-end training strategy, it does not necessitate additional data annotation, and mitigates the potential adverse effects from the upstream task to the downstream task. Extensive experiments on three widely-used datasets show that in both within-dataset and cross-dataset scenarios our proposed method outperforms other strong baselines significantly.



Paperid:2468
Authors:Lydia T. Liu, Solon Barocas, Jon Kleinberg, Karen Levy
Cornell University Princeton University, Microsoft Research Cornell University, Cornell University, Cornell University
Abstract:
Predicting future outcomes is a prevalent application of machine learning in social impact domains. Examples range from predicting student success in education to predicting disease risk in healthcare. Practitioners recognize that the ultimate goal is not just to predict but to act effectively. Increasing evidence suggests that relying on outcome predictions for downstream interventions may not have desired results. In most domains there exists a multitude of possible interventions for each individual, making the challenge of taking effective action more acute. Even when causal mechanisms connecting the individual's latent states to outcomes are well understood, in any given instance (a specific student or patient), practitioners still need to inferfrom budgeted measurements of latent states---which of many possible interventions will be most effective for this individual. With this in mind, we ask: when are accurate predictors of outcomes helpful for identifying the most suitable intervention? Through a simple model encompassing actions, latent states, and measurements, we demonstrate that pure outcome prediction rarely results in the most effective policy for taking actions, even when combined with other measurements. We find that except in cases where there is a single decisive action for improving the outcome, outcome prediction never maximizes "action value", the utility of taking actions. Making measurements of actionable latent states, where specific actions lead to desired outcomes, may considerably enhance the action value compared to outcome prediction, and the degree of improvement depends on action costs and the outcome model. This analysis emphasizes the need to go beyond generic outcome prediction in interventional settings by incorporating knowledge of plausible actions and latent states.



Paperid:2469
Authors:Xiangrui Liu, Xiaoou Liu, Shan Du, Julian Cheng
The University of British Columbia, The University of British Columbia, The University of British Columbia, The University of British Columbia
Abstract:
Marine mammals and their ecosystem face significant threats from, for example, military active sonar and marine transportation. To mitigate this harm, early detection and classification of marine mammals are essential. While recent efforts have utilized spectrogram analysis and machine learning techniques, there remain challenges in their efficiency. Therefore, we propose a novel knowledge distillation framework, named XCFSMN, for this problem. We construct a teacher model that fuses the features extracted from an Xvector extractor, a DenseNet and Cross-Covariance attended compact Feed-Forward Sequential Memory Network (cFSMN). The teacher model transfers knowledge to a simpler cFSMN model through a temperature-cooling strategy for efficient learning. Compared to multiple convolutional neural network backbones and transformers, the proposed framework achieves state-of-the-art efficiency and performance. The improved model size is approximately 20 times smaller and the inference time can be 10 times shorter without affecting the model’s accuracy.



Paperid:2470
Authors:Zhi Liu, Sarah Rankin, Nikhil Garg
Cornell Tech The New York Public Library, The New York Public Library, Cornell Tech
Abstract:
Public libraries are an essential public good. We ask: are urban library systems providing equitable service to all residents, in terms of the books they have access to and check out? If not, what causes disparities: heterogeneous book collections, resident behavior and access, and/or operational policies? Existing methods leverage only systemlevel outcome data (such as overall checkouts per branch), and so cannot distinguish between these factors. As a result, it is difficult to use their results to guide interventions to increase equitable access. We propose a Bayesian framework to characterize book checkout behavior across multiple branches of a library system, learning heterogeneous book popularity, overall branch demand, and usage of the online hold system, while controlling for book availability. In collaboration with the New York Public Library, we apply our framework to granular data consisting of over 400,000 checkouts during 2022. We first show that our model significantly out-performs baseline methods in predicting checkouts at the book-branch level. Next, we study spatial and socioeconomic disparities. We show that disparities are largely driven by disparate use of the online holds system, which allows library patrons to receive books from any other branch through an online portal. This system thus leads to a large outflow of popular books from branches in lower income neighborhoods to those in high income ones. Finally, we illustrate the use of our model and insights to quantify the impact of potential interventions, such as changing how books are internally routed between branches to fulfill hold requests.



Paperid:2471
Authors:Antoine Louis, Gijs van Dijck, Gerasimos Spanakis
Maastricht University, Maastricht University, Maastricht University
Abstract:
Many individuals are likely to face a legal dispute at some point in their lives, but their lack of understanding of how to navigate these complex issues often renders them vulnerable. The advancement of natural language processing opens new avenues for bridging this legal literacy gap through the development of automated legal aid systems. However, existing legal question answering (LQA) approaches often suffer from a narrow scope, being either confined to specific legal domains or limited to brief, uninformative responses. In this work, we propose an endto-end methodology designed to generate long-form answers to any statutory law questions, utilizing a "retrieve-then-read" pipeline. To support this approach, we introduce and release the Long-form Legal Question Answering (LLeQA) dataset, comprising 1,868 expert-annotated legal questions in the French language, complete with detailed answers rooted in pertinent legal provisions. Our experimental results demonstrate promising performance on automatic evaluation metrics, but a qualitative analysis uncovers areas for refinement. As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains. We publicly release our code, data, and models.



Paperid:2472
Authors:Pratheeksha Nair, Javin Liu, Catalina Vajiac, Andreas Olligschlaeger, Duen Horng Chau, Mirela Cazzolato, Cara Jones, Christos Faloutsos, Reihaneh Rabbany
McGill University Mila - Quebec AI Institute, Mila - Quebec AI Institute, Carnegie Mellon University, i3 LLC, Georgia Institute of Technology, University of São Paulo, Marinus Analytics, Carnegie Mellon University, McGill University Mila - Quebec AI Institute
Abstract:
Human trafficking (HT) for forced sexual exploitation, often described as modernday slavery, is a pervasive problem that affects millions of people worldwide. Perpetrators of this crime post advertisements (ads) on behalf of their victims on adult service websites (ASW). These websites typically contain hundreds of thousands of ads including those posted by independent escorts, massage parlor agencies and spammers (fake ads). Detecting suspicious activity in these ads is difficult and developing data-driven methods is challenging due to the hard-to-label, complex and sensitive nature of the data. In this paper, we propose T-Net, which unlike previous solutions, formulates this problem as weakly supervised classification. Since it takes several months to years to investigate a case and obtain a single definitive label, we design domain-specific signals or indicators that provide weak labels. T-Net also looks into connections between ads and models the problem as a graph learning task instead of classifying ads independently. We show that T-Net outperforms all baselines on a real-world dataset of ads by 7% average weighted F1 score. Given that this data contains personally identifiable information, we also present a realistic data generator and provide the first publicly available dataset in this domain which may be leveraged by the wider research community.



Paperid:2473
Authors:Nicola Neophytou, Afaf Taik, Golnoosh Farnadi
Mila, Quebec AI Institute Université de Montréal, Mila, Quebec AI Institute Université de Montréal, Mila, Quebec AI Institute Université de Montréal McGill University
Abstract:
The aftermath of the Covid19 pandemic saw more severe outcomes for racial minority groups and economically-deprived communities. Such disparities can be explained by several factors, including unequal access to healthcare, as well as the inability of low income groups to reduce their mobility due to work or social obligations. Moreover, senior citizens were found to be more susceptible to severe symptoms, largely due to age-related health reasons. Adapting vaccine distribution strategies to consider a range of demographics is therefore essential to address these disparities. In this study, we propose a novel approach that utilizes influence maximization (IM) on mobility networks to develop vaccination strategies which incorporate demographic fairness. By considering factors such as race, social status, age, and associated risk factors, we aim to optimize vaccine distribution to achieve various fairness definitions for one or more protected attributes at a time. Through extensive experiments conducted on Covid-19 spread in three major metropolitan areas across the United States, we demonstrate the effectiveness of our proposed approach in reducing disease transmission and promoting fairness in vaccination distribution.



Paperid:2474
Authors:Gustavo Perez, Subhransu Maji, Daniel Sheldon
University of Massachusetts, Amherst, University of Massachusetts, Amherst, University of Massachusetts, Amherst
Abstract:
Many applications use computer vision to detect and count objects in massive image collections. However, automated methods may fail to deliver accurate counts, especially when the task is very difficult or requires a fast response time. For example, during disaster response, aid organizations aim to quickly count damaged buildings in satellite images to plan relief missions, but pretrained building and damage detectors often perform poorly due to domain shifts. In such cases, there is a need for human-in-the-loop approaches to accurately count with minimal human effort. We propose DISCount -- a detector-based importance sampling framework for counting in large image collections. DISCount uses an imperfect detector and human screening to estimate low-variance unbiased counts. We propose techniques for counting over multiple spatial or temporal regions using a small amount of screening and estimate confidence intervals. This enables end-users to stop screening when estimates are sufficiently accurate, which is often the goal in real-world applications. We demonstrate our method with two applications: counting birds in radar imagery to understand responses to climate change, and counting damaged buildings in satellite imagery for damage assessment in regions struck by a natural disaster. On the technical side we develop variance reduction techniques based on control variates and prove the (conditional) unbiasedness of the estimators. DISCount leads to a 9-12x reduction in the labeling costs to obtain the same error rates compared to naive screening for tasks we consider, and surpasses alternative covariate-based screening approaches.



Paperid:2475
Authors:Gaurab Pokharel, Sanmay Das, Patrick Fowler
George Mason University, George Mason University, Washington University in St. Louis
Abstract:
Streetlevel bureaucrats interact directly with people on behalf of government agencies to perform a wide range of functions, including, for example, administering social services and policing. A key feature of street-level bureaucracy is that the civil servants, while tasked with implementing agency policy, are also granted significant discretion in how they choose to apply that policy in individual cases. Using that discretion could be beneficial, as it allows for exceptions to policies based on human interactions and evaluations, but it could also allow biases and inequities to seep into important domains of societal resource allocation. In this paper, we use machine learning techniques to understand street-level bureaucrats' behavior. We leverage a rich dataset that combines demographic and other information on households with information on which homelessness interventions they were assigned during a period when assignments were not formulaic. We find that caseworker decisions in this time are highly predictable overall, and some, but not all of this predictivity can be captured by simple decision rules. We theorize that the decisions not captured by the simple decision rules can be considered applications of caseworker discretion. These discretionary decisions are far from random in both the characteristics of such households and in terms of the outcomes of the decisions. Caseworkers typically only apply discretion to households that would be considered less vulnerable. When they do apply discretion to assign households to more intensive interventions, the marginal benefits to those households are significantly higher than would be expected if the households were chosen at random; there is no similar reduction in marginal benefit to households that are discretionarily allocated less intensive interventions, suggesting that caseworkers are using their knowledge and experience to improve outcomes for households experiencing homelessness.



Paperid:2476
Authors:Nihar Ranja Sahoo, Gyana Prakash Beria, Pushpak Bhattacharyya
Indian Institute of Technology, Bombay, Indian Institute of Technology, Bombay, Indian Institute of Technology, Bombay
Abstract:
Hate speech (HS) is a growing concern in many parts of the world, including India, where it has led to numerous instances of violence and discrimination. The development of effective counternarratives (CNs) is a critical step in combating hate speech, but there is a lack of research in this area, especially in non-English languages. In this paper, we introduce a new dataset, IndicCONAN, of counter-narratives against hate speech in Hindi and Indian English. We propose a scalable human-in-the-loop approach for generating counter-narratives by an auto-regressive language model through machine generation - human correction cycle, where the model uses augmented data from previous cycles to generate new training samples. These newly generated samples are then reviewed and edited by annotators, leading to further model refnement. The dataset consists of over 2,500 exam- ˜ ples of counter-narratives each in both English and Hindi corresponding to various hate speeches in the Indian context. We also present a framework for generating CNs conditioned on specifc CN type with a mean perplexity of 3.85 for English and 3.70 for Hindi, a mean toxicity score of 0.04 for English and 0.06 for Hindi, and a mean diversity of 0.08 for English and 0.14 for Hindi. Our dataset and framework provide valuable resources for researchers and practitioners working to combat hate speech in the Indian context.



Paperid:2477
Authors:Soumyendu Sarkar, Avisek Naug, Ricardo Luna, Antonio Guillen, Vineet Gundecha, Sahand Ghorbanpour, Sajad Mousavi, Dejan Markovikj, Ashwin Ramesh Babu
Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise
Abstract:
As machine learning workloads are significantly increasing energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a realtime controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry-standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions.



Paperid:2478
Authors:Travis Seale-Carlisle, Saksham Jain, Courtney Lee, Caroline Levenson, Swathi Ramprasad, Brandon Garrett, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky
University of Aberdeen, University of Washington, Duke University, Duke University, Duke University, Duke University, Duke University, Duke University, Duke University
Abstract:
After a person is arrested and charged with a crime, they may be released on bail and required to participate in a community supervision program while awaiting trial. These 'pretrial programs' are common throughout the United States, but very little research has demonstrated their effectiveness. Researchers have emphasized the need for more rigorous program evaluation methods, which we introduce in this article. We describe a program evaluation pipeline that uses recent interpretable machine learning techniques for observational causal inference, and demonstrate these techniques in a study of a pre-trial program in Durham, North Carolina. Our findings show no evidence that the program either significantly increased or decreased the probability of new criminal charges. If these findings replicate, the criminal-legal system needs to either improve pre-trial programs or consider alternatives to them. The simplest option is to release low-risk individuals back into the community without subjecting them to any restrictions or conditions. Another option is to assign individuals to pre-trial programs that incentivize pro-social behavior. We believe that the techniques introduced here can provide researchers the rigorous tools they need to evaluate these programs.



Paperid:2479
Authors:Eunseon Seong, Harim Lee, Dong-Kyu Chae
Hanyang University, Hanyang university, Hanyang University
Abstract:
With the widespread adoption of IoT, wearable devices, and sensors, time series data from human subjects are significantly increasing in the healthcare domain. Due to the laborious nature of manual annotation in time series data and the requirement for human experts, selfsupervised learning methods are attempted to alleviate the limited label situations. While existing self-supervised methods have been successful to achieve comparable performance to the fully supervised methods, there are still some limitations that need to be addressed, considering the nature of time series data from human subjects: In real-world clinical settings, data labels (e.g., sleep stages) are usually annotated by subject-level, and there is a substantial variation in patterns between subjects. Thus, a model should be designed to deal with not only the label scarcity but also subject-wise nature of data to ensure high performance in real-world scenarios. To mitigate these issues, we propose a novel self-supervised learning framework for human subject time series data: Subject-Aware Time Series Clustering (SA-TSC). In the unsupervised representation learning phase, SA-TSC adopts a subject-wise learning strategy rather than instance-wise learning which randomly samples data instances from different subjects within the batch during training. Specifically, we generate subject-graphs with our graph construction method based on Gumbel-Softmax and perform graph spectral clustering on each subject-graph. In addition, we utilize graph neural networks to capture dependencies between channels and design our own graph learning module motivated from self-supervised loss. Experimental results show the outstanding performance of our SA-TSC with the limited & subject-wise label setting, leading to its high applicability to the healthcare industry. The code is available at: https://github.com/DILAB-HYU/SA-TSC



Paperid:2480
Authors:Omar Sharif, Madhusudan Basak, Tanzia Parvin, Ava Scharfstein, Alphonso Bradham, Jacob T. Borodovsky, Sarah E. Lord, Sarah M. Preum
Department of Computer Science, Dartmouth College, Department of Computer Science, Dartmouth College, Department of Computer Science and Engineering, Chittagong University of Engineering and Technology (CUET), Bangladesh, Department of Computer Science, Dartmouth College, Department of Computer Science, Dartmouth College, Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College Department of Psychiatry, Dartmouth Health, Department of Computer Science, Dartmouth College Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College
Abstract:
Social media sites have become a popular platform for individuals to seek and share health information. Despite the progress in natural language processing for social media mining, a gap remains in analyzing healthrelated texts on social discourse in the context of events. Event-driven analysis can offer insights into different facets of healthcare at an individual and collective level, including treatment options, misconceptions, knowledge gaps, etc. This paper presents a paradigm to characterize health-related information-seeking in social discourse through the lens of events. Events here are board categories defined with domain experts that capture the trajectory of the treatment/medication. To illustrate the value of this approach, we analyze Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical global health concern. To the best of our knowledge, this is the first attempt to define event categories for characterizing information-seeking in OUD social discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel treatment information-seeking event dataset to analyze online discourse on an event-based framework. This dataset contains Reddit posts on information-seeking events related to recovery from OUD, where each post is annotated based on the type of events. We also establish a strong performance benchmark (77.4% F1 score) for the task by employing several machine learning and deep learning classifiers. Finally, we thoroughly investigate the performance and errors of ChatGPT on this task, providing valuable insights into the LLM's capabilities and ongoing characterization efforts.



Paperid:2481
Authors:Ajitesh Srivastava, Juan Marcos Ramirez, Sergio Díaz-Aranda, Jose Aguilar, Antonio Fernández Anta, Antonio Ortega, Rosa Elvira Lillo
University of Southern California, IMDEA Networks Institute, IMDEA Networks Institute Universidad Carlos III, IMDEA Networks Institute, IMDEA Networks Institute, University of Southern California, Universidad Carlos III
Abstract:
Indirect surveys, in which respondents provide information about other people they know, have been proposed for estimating (nowcasting) the size of a hidden population where privacy is important or the hidden population is hard to reach. Examples include estimating casualties in an earthquake, conditions among female sex workers, and the prevalence of drug use and infectious diseases. The Network Scaleup Method (NSUM) is the classical approach to developing estimates from indirect surveys, but it was designed for one-shot surveys. Further, it requires certain assumptions and asking for or estimating the number of individuals in each respondent's network. In recent years, surveys have been increasingly deployed online and can collect data continuously (e.g., COVID-19 surveys on Facebook during much of the pandemic). Conventional NSUM can be applied to these scenarios by analyzing the data independently at each point in time, but this misses the opportunity of leveraging the temporal dimension. We propose to use the responses from indirect surveys collected over time and develop analytical tools (i) to prove that indirect surveys can provide better estimates for the trends of the hidden population over time, as compared to direct surveys and (ii) to identify appropriate temporal aggregations to improve the estimates. We demonstrate through extensive simulations that our approach outperforms traditional NSUM and direct surveying methods. We also empirically demonstrate the superiority of our approach on a real indirect survey dataset of COVID-19 cases.



Paperid:2482
Authors:Errikos Streviniotis, Athina Georgara, Filippo Bistaffa, Georgios Chalkiadakis
Technical University of Crete Athena Research Center, Artificial Intelligence Research Institute (IIIA), CSIC, Artificial Intelligence Research Institute (IIIA), CSIC, Technical University of Crete
Abstract:
In recent years, popular touristic destinations face overtourism. Local communities suffer from its consequences in several ways. Among others, overpricing and profiteering harms local societies and economies deeply. In this paper we focus on the problem of determining fair hotel room prices. Specifically, we put forward a dynamic pricing policy where the price of a room depends not only on the demand of the hotel it belongs to but also on the demand of: (i) similar rooms in the area and (ii) their hotels. To this purpose, we model our setting as a cooperative game and exploit an appropriate game theoretic solution concept that promotes fairness both on the customers' and the providers' side. Our simulation results involving price adjustments across realworld hotels datasets, confirm that ours is a fair dynamic pricing policy, avoiding both over- and under-pricing hotel rooms.



Paperid:2483
Authors:Zhaohong Sun, Naoyuki Yamada, Yoshihiro Takenami, Daisuke Moriwaki, Makoto Yokoo
Kyushu University CyberAgent Inc., CyberAgent Inc., CyberAgent Inc., CyberAgent Inc., Kyushu University
Abstract:
We study a practical twosided matching problem of allocating children to daycare centers, which has significant social implications. We are cooperating with several municipalities in Japan and our goal is to devise a reliable and trustworthy clearing algorithm to deal with the problem. In this paper, we describe the design of our new algorithm that minimizes the number of unmatched children while ensuring stability. We evaluate our algorithm using real-life data sets, and experimental results demonstrate that our algorithm surpasses the commercial software that currently dominates the market in terms of both the number of matched children and the number of blocking coalitions (measuring stability). Our findings have been reported to local governments, and some are considering adopting our proposed algorithm in the near future, instead of the existing solution. Moreover, our model and algorithm have broader applicability to other important matching markets, such as hospital-doctor matching with couples and school choice with siblings.



Paperid:2484
Authors:Nicolas Troquard, Martina De Sanctis, Paola Inverardi, Patrizio Pelliccione, Gian Luca Scoccia
Gran Sasso Science Institute (GSSI), L'Aquila, Gran Sasso Science Institute (GSSI), L'Aquila, Gran Sasso Science Institute (GSSI), L'Aquila, Gran Sasso Science Institute (GSSI), L'Aquila, Gran Sasso Science Institute (GSSI), L'Aquila
Abstract:
The rise of AIbased and autonomous systems is raising concerns and apprehension due to potential negative repercussions arising from their behavior or decisions. These systems must be designed to comply with the human contexts in which they will operate. To this extent, Townsend et al. (2022) introduce the concept of SLEEC (social, legal, ethical, empathetic, or cultural) rules that aim to facilitate the formulation, verification, and enforcement of the rules AI-based and autonomous systems should obey. They lay out a methodology to elicit them and to let philosophers, lawyers, domain experts, and others to formulate them in natural language. To enable their effective use in AI systems, it is necessary to translate these rules systematically into a formal language that supports automated reasoning. In this study, we first conduct a linguistic analysis of the SLEEC rules pattern, which justifies the translation of SLEEC rules into classical logic. Then we investigate the computational complexity of reasoning about SLEEC rules and show how logical programming frameworks can be employed to implement SLEEC rules in practical scenarios. the result is a readily applicable strategy for implementing AI systems that conform to norms expressed as SLEEC rules.



Paperid:2485
Authors:Catalina Vajiac, Arun Frey, Joachim Baumann, Abigail Smith, Kasun Amarasinghe, Alice Lai, Kit T. Rodolfa, Rayid Ghani
Carnegie Mellon University, Stanford University, University of Zurich Zurich University of Applied Sciences, NORC at the University of Chicago, Carnegie Mellon University, Carnegie Mellon University, Stanford University, Carnegie Mellon Uniiversity
Abstract:
Rental assistance programs provide individuals with financial assistance to prevent housing instabilities caused by evictions and avert homelessness. Since these programs operate under resource constraints, they must decide who to prioritize. Typically, funding is distributed by a reactive allocation process that does not systematically consider risk of future homelessness. We partnered with Anonymous County (PA) to explore a proactive and preventative allocation approach that prioritizes individuals facing eviction based on their risk of future homelessness. Our ML models, trained on state and county administrative data accurately identify atrisk individuals, outperforming simpler prioritization approaches by at least 20% while meeting our equity and fairness goals across race and gender. Furthermore, our approach would reach 28% of individuals who are overlooked by the current process and end up homeless. Beyond improvements to the rental assistance program in Anonymous County, this study can inform the development of evidence-based decision support tools in similar contexts, including lessons about data needs, model design, evaluation, and field validation.



Paperid:2486
Authors:Tanvi Verma, Linh Le Dinh, Nicholas Tan, Xinxing Xu, Chingyu Cheng, Yong Liu
Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore, Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore
Abstract:
Visual perimetry is an important eye examination that helps detect vision problems caused by ocular or neurological conditions. During the test, a patient's gaze is fixed at a specific location while light stimuli of varying intensities are presented in central and peripheral vision. Based on the patient's responses to the stimuli, the visual field mapping and sensitivity are determined. However, maintaining high levels of concentration throughout the test can be challenging for patients, leading to increased examination times and decreased accuracy. In this work, we present RLPeri, a reinforcement learningbased approach to optimize visual perimetry testing. By determining the optimal sequence of locations and initial stimulus values, we aim to reduce the examination time without compromising accuracy. Additionally, we incorporate reward shaping techniques to further improve the testing performance. To monitor the patient's responses over time during testing, we represent the test's state as a pair of 3D matrices. We apply two different convolutional kernels to extract spatial features across locations as well as features across different stimulus values for each location. Through experiments, we demonstrate that our approach results in a 10-20% reduction in examination time while maintaining the accuracy as compared to state-of-the-art methods. With the presented approach, we aim to make visual perimetry testing more efficient and patient-friendly, while still providing accurate results.



Paperid:2487
Authors:Yifan Wang, Qining Zhang, Lei Ying, Chuan Zhou
University of Michigan, University of Michigan, University of Michigan, University of Michigan
Abstract:
Lung cancer remains the leading cause of cancerrelated death worldwide, and early diagnosis of lung cancer is critical for improving the survival rate of patients. Performing annual low-dose computed tomography (LDCT) screening among high-risk populations is the primary approach for early diagnosis. However, after each screening, whether to continue monitoring (with follow-up screenings) or to order a biopsy for diagnosis remains a challenging decision to make. Continuing with follow-up screenings may lead to delayed diagnosis but ordering a biopsy without sufficient evidence incurs unnecessary risk and cost. In this paper, we tackle the problem by an optimal stopping approach. Our proposed algorithm, called EarlyStop-RL, utilizes the structure of the Snell envelope for optimal stopping, and model-free deep reinforcement learning for making diagnosis decisions. Through evaluating our algorithm on a commonly used clinical trial dataset (the National Lung Screening Trial), we demonstrate that EarlyStop-RL has the potential to greatly enhance risk assessment and early diagnosis of lung cancer, surpassing the performance of two widely adopted clinical models, namely the Lung-RADS and the Brock model.



Paperid:2488
Authors:Zhihao Wang, Yiqun Xie, Zhili Li, Xiaowei Jia, Zhe Jiang, Aolin Jia, Shuo Xu
University of Maryland, University of Maryland, University of Maryland, University of Pittsburgh, University of Florida, University of Maryland, University of Maryland
Abstract:
Fairnessawareness has emerged as an essential building block for the responsible use of artificial intelligence in real applications. In many cases, inequity in performance is due to the change in distribution over different regions. While techniques have been developed to improve the transferability of fairness, a solution to the problem is not always feasible with no samples from the new regions, which is a bottleneck for pure data-driven attempts. Fortunately, physics-based mechanistic models have been studied for many problems with major social impacts. We propose SimFair, a physics-guided fairness-aware learning framework, which bridges the data limitation by integrating physical-rule-based simulation and inverse modeling into the training design. Using temperature prediction as an example, we demonstrate the effectiveness of the proposed SimFair in fairness preservation.



Paperid:2489
Authors:R. Teal Witter, Lucas Rosenblatt
New York University, New York University
Abstract:
The open streets initiative "opens" streets to pedestrians and bicyclists by closing them to cars and trucks. The initiative, adopted by many cities across North America, increases community space in urban environments. But could open streets also make cities safer and less congested? We study this question by framing the choice of which streets to open as a reinforcement learning problem. In order to simulate the impact of opening streets, we first compare models for predicting vehicle collisions given network and temporal data. We find that a recurrent graph neural network, leveraging the graph structure and the shortterm temporal dependence of the data, gives the best predictive performance. Then, with the ability to simulate collisions and traffic, we frame a reinforcement learning problem to find which streets to open. We compare the streets in the open streets initiative to those proposed by a Q-learning algorithm. We find that the streets proposed by the Q-learning algorithm have reliably better outcomes, while streets already selected by the open streets initiative have similar outcomes to randomly selected streets. We present our work as a step toward principally choosing which streets to open for safer and less congested cities.



Paperid:2490
Authors:Jonathan Xu, Amna Elmustafa, Liya Weldegebriel, Emnet Negash, Richard Lee, Chenlin Meng, Stefano Ermon, David Lobell
University of Waterloo, Stanford University, Stanford University, Ghent University Mekelle University, Stanford University, Stanford University, Stanford University, Stanford University
Abstract:
Small farms contribute to a large share of the productive land in developing countries. In regions such as subSaharan Africa, where 80% of farms are small (under 2 ha in size), the task of mapping smallholder cropland is an important part of tracking sustainability measures such as crop productivity. However, the visually diverse and nuanced appearance of small farms has limited the effectiveness of traditional approaches to cropland mapping. Here we introduce a new approach based on the detection of harvest piles characteristic of many smallholder systems throughout the world. We present HarvestNet, a dataset for mapping the presence of farms in the Ethiopian regions of Tigray and Amhara during 2020-2023, collected using expert knowledge and satellite images, totalling 7k hand-labeled images and 2k ground-collected labels. We also benchmark a set of baselines, including SOTA models in remote sensing, with our best models having around 80% classification performance on hand labelled data and 90% and 98% accuracy on ground truth data for Tigray and Amhara, respectively. We also perform a visual comparison with a widely used pre-existing coverage map and show that our model detects an extra 56,621 hectares of cropland in Tigray. We conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food insecure regions. The dataset can be accessed through https://figshare.com/s/45a7b45556b90a9a11d2, while the code for the dataset and benchmarks is publicly available at https://github.com/jonxuxu/harvest-piles



Paperid:2491
Authors:Xiaofei Xu, Ke Deng, Michael Dann, Xiuzhen Zhang
RMIT University, RMIT University, RMIT University, RMIT University
Abstract:
This study aims to minimize the influence of fake news on social networks by deploying debunkers to propagate true news. This is framed as a reinforcement learning problem, where, at each stage, one user is selected to propagate true news. A challenging issue is episodic reward where the "net" effect of selecting individual debunkers cannot be discerned from the interleaving information propagation on social networks, and only the collective effect from mitigation efforts can be observed. Existing SelfImitation Learning (SIL) methods have shown promise in learning from episodic rewards, but are ill-suited to the real-world application of fake news mitigation because of their poor sample efficiency. To learn a more effective debunker selection policy for fake news mitigation, this study proposes NAGASIL - Negative sampling and state Augmented Generative Adversarial Self-Imitation Learning, which consists of two improvements geared towards fake news mitigation: learning from negative samples, and an augmented state representation to capture the "real" environment state by integrating the current observed state with the previous state-action pairs from the same campaign. Experiments on two social networks show that NAGASIL yields superior performance to standard GASIL and state-of-the-art fake news mitigation models.



Paperid:2492
Authors:Zelin Xu, Tingsong Xiao, Wenchong He, Yu Wang, Zhe Jiang, Shigang Chen, Yiqun Xie, Xiaowei Jia, Da Yan, Yang Zhou
University of Florida, Gainesville, FL, USA, University of Florida, Gainesville, FL, USA, University of Florida, Gainesville, FL, USA, University of Florida, Gainesville, FL, USA, University of Florida, Gainesville, FL, USA, University of Florida, Gainesville, FL, USA, The University of Maryland, College Park, MD, USA, University of Pittsburgh, Pittsburgh, PA, USA, Indiana University Bloomington, Bloomington, IN, USA, Auburn University, Auburn, AL, USA
Abstract:
Flood mapping on Earth imagery is crucial for disaster management, but its efficacy is hampered by the lack of highquality training labels. Given high-resolution Earth imagery with coarse and noisy training labels, a base deep neural network model, and a spatial knowledge base with label constraints, our problem is to infer the true high-resolution labels while training neural network parameters. Traditional methods are largely based on specific physical properties and thus fall short of capturing the rich domain constraints expressed by symbolic logic. Neural-symbolic models can capture rich domain knowledge, but existing methods do not address the unique spatial challenges inherent in flood mapping on high-resolution imagery. To fill this gap, we propose a spatial-logic-aware weakly supervised learning framework. Our framework integrates symbolic spatial logic inference into probabilistic learning in a weakly supervised setting. To reduce the time costs of logic inference on vast high-resolution pixels, we propose a multi-resolution spatial reasoning algorithm to infer true labels while training neural network parameters. Evaluations of real-world flood datasets show that our model outperforms several baselines in prediction accuracy. The code is available at https://github.com/spatialdatasciencegroup/SLWSL.



Paperid:2493
Authors:Kaixun Yang, Mladen Raković, Yuyang Li, Quanlong Guan, Dragan Gašević, Guangliang Chen
Monash University, Monash University, Monash University, Jinan University, Monash University, Monash University
Abstract:
Automatic Essay Scoring (AES) is a wellestablished educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven distinct metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models (e.g., SVM) coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models.



Paperid:2494
Authors:Zirui Yuan, Minglai Shao, Zhiqian Chen
Tianjin University, Tianjin University, Mississippi State University
Abstract:
Influence maximization (IM) is the problem of identifying a limited number of initial influential users within a social network to maximize the number of influenced users. However, previous research has mostly focused on individual information propagation, neglecting the simultaneous and interactive dissemination of multiple information items. In reality, when users encounter a piece of information, such as a smartphone product, they often associate it with related products in their minds, such as earphones or computers from the same brand. Additionally, information platforms frequently recommend related content to users, amplifying this cascading effect and leading to multiplex influence diffusion. This paper first formulates the Multiplex Influence Maximization (MultiIM) problem using multiplex diffusion models with an information association mechanism. In this problem, the seed set is a combination of influential users and information. To effectively manage the combinatorial complexity, we propose Graph Bayesian Optimization for Multi-IM (GBIM). The multiplex diffusion process is thoroughly investigated using a highly effective global kernelized attention message-passing module. This module, in conjunction with Bayesian linear regression (BLR), produces a scalable surrogate model. A data acquisition module incorporating the exploration-exploitation trade-off is developed to optimize the seed set further. Extensive experiments on synthetic and real-world datasets have proven our proposed framework effective. The code is available at https://github.com/zirui-yuan/GBIM.



Paperid:2495
Authors:Abdelrahman Zayed, Gonçalo Mordido, Samira Shabanian, Ioana Baldini, Sarath Chandar
Mila Polytechnique Montreal, Mila Polytechnique Montreal, Independent Researcher, IBM Research, Mila Polytechnique Montreal Canada CIFAR AI Chair
Abstract:
The increasing size of large language models (LLMs) has introduced challenges in their training and inference. Removing model components is perceived as a solution to tackle the large model sizes, however, existing pruning methods solely focus on performance, without considering an essential aspect for the responsible use of LLMs: model fairness. It is crucial to address the fairness of LLMs towards diverse groups, such as women, Black people, LGBTQ+, Jewish communities, among others, as they are being deployed and available to a wide audience. In this work, first, we investigate how attention heads impact fairness and performance in pretrained transformer-based language models. We then propose a novel method to prune the attention heads that negatively impact fairness while retaining the heads critical for performance, i.e. language modeling capabilities. Our approach is practical in terms of time and resources, as it does not require fine-tuning the final pruned, and fairer, model. Our findings demonstrate a reduction in gender bias by 19%, 19.5%, 39.5%, 34.7%, 23%, and 8% for DistilGPT-2, GPT-2, GPT-Neo of two different sizes, GPT-J, and Llama 2 models, respectively, in comparison to the biased model, with only a slight decrease in performance. WARNING: This work uses language that is offensive in nature.



Paperid:2496
Authors:Jinwei Zeng, Yu Liu, Jingtao Ding, Jian Yuan, Yong Li
Beijing National Research Center for Information Science and Technology (BNRist) Department of Electronic Engineering, Tsinghua University, China, Beijing National Research Center for Information Science and Technology (BNRist) Department of Electronic Engineering, Tsinghua University, China, Beijing National Research Center for Information Science and Technology (BNRist) Department of Electronic Engineering, Tsinghua University, China, Beijing National Research Center for Information Science and Technology (BNRist) Department of Electronic Engineering, Tsinghua University, China, Beijing National Research Center for Information Science and Technology (BNRist) Department of Electronic Engineering, Tsinghua University, China
Abstract:
Accounting for over 20% of the total carbon emissions, the precise estimation of onroad transportation carbon emissions is crucial for carbon emission monitoring and efficient mitigation policy formulation. However, existing estimation methods typically depend on hard-to-collect individual statistics of vehicle miles traveled to calculate emissions, thereby suffering from high data collection difficulty. To relieve this issue by utilizing the strong pattern recognition of artificial intelligence, we incorporate two sources of open data representative of the transportation demand and capacity factors, the origin-destination (OD) flow data and the road network data, to build a hierarchical heterogeneous graph learning method for on-road carbon emission estimation (HENCE). Specifically, a hierarchical graph consisting of the road network level, community level, and region level is constructed to model the multi-scale road network-based connectivity and travel connection between spatial areas. Heterogeneous graphs consisting of OD links and spatial links are further built at both the community level and region level to capture the intrinsic interactions between travel demand and road network accessibility. Extensive experiments on two large-scale real-world datasets demonstrate HENCE's effectiveness and superiority with R-squared exceeding 0.75 and outperforming baselines by 9.60% on average, validating its success in pioneering the use of artificial intelligence to empower carbon emission management and sustainability development. The implementation codes are available at this link: https://github.com/tsinghua-fib-lab/HENCE.



Paperid:2497
Authors:Zijie Zeng, Lele Sha, Yuheng Li, Kaixun Yang, Dragan Gašević, Guangliang Chen
Monash University, Monash University, Monash University, Monash University, Monash University, Monash University
Abstract:
The recent large language models (LLMs), e.g., ChatGPT, have been able to generate humanlike and fluent responses when provided with specific instructions. While admitting the convenience brought by technological advancement, educators also have concerns that students might leverage LLMs to complete their writing assignments and pass them off as their original work. Although many AI content detection studies have been conducted as a result of such concerns, most of these prior studies modeled AI content detection as a classification problem, assuming that a text is either entirely human-written or entirely AI-generated. In this study, we investigated AI content detection in a rarely explored yet realistic setting where the text to be detected is collaboratively written by human and generative LLMs (termed as hybrid text for simplicity). We first formalized the detection task as identifying the transition points between human-written content and AI-generated content from a given hybrid text (boundary detection). We constructed a hybrid essay dataset by partially and randomly removing sentences from the original student-written essays and then instructing ChatGPT to fill in for the incomplete essays. Then we proposed a two-step detection approach where we (1) separated AI-generated content from human-written content during the encoder training process; and (2) calculated the distances between every two adjacent prototypes (a prototype is the mean of a set of consecutive sentences from the hybrid text in the embedding space) and assumed that the boundaries exist between the two adjacent prototypes that have the furthest distance from each other. Through extensive experiments, we observed the following main findings: (1) the proposed approach consistently outperformed the baseline methods across different experiment settings; (2) the encoder training process (i.e., step 1 of the above two-step approach) can significantly boost the performance of the proposed approach; (3) when detecting boundaries for single-boundary hybrid essays, the proposed approach could be enhanced by adopting a relatively large prototype size (i.e., the number of sentences needed to calculate a prototype), leading to a 22% improvement (against the best baseline method) in the In-Domain evaluation and an 18% improvement in the Out-of-Domain evaluation.



Paperid:2498
Authors:Rui Zhang, Dawei Cheng, Jie Yang, Yi Ouyang, Xian Wu, Yefeng Zheng, Changjun Jiang
Tongji University Shanghai Artificial Intelligence Laboratory, Tongji University Key Laboratory of Artificial Intelligence, Ministry of Education Shanghai Artificial Intelligence Laboratory, Tongji University, Tencent YouTu Lab, Tencent YouTu Lab, Tencent YouTu Lab, Tongji University Shanghai Artificial Intelligence Laboratory
Abstract:
Medical insurance fraud has always been a crucial challenge in the field of healthcare industry. Existing fraud detection models mostly focus on offline learning scenes. However, fraud patterns are constantly evolving, making it difficult for models trained on past data to detect newly emerging fraud patterns, posing a severe challenge in medical fraud detection. Moreover, current incremental learning models are mostly designed to address catastrophic forgetting, but often exhibit suboptimal performance in fraud detection. To address this challenge, this paper proposes an innovative online learning method for medical insurance fraud detection, named POCL. This method combines contrastive learning pretraining with online updating strategies. In the pre-training stage, we leverage contrastive learning pre-training to learn on historical data, enabling deep feature learning and obtaining rich risk representations. In the online learning stage, we adopt a Temporal Memory Aware Synapses online updating strategy, allowing the model to perform incremental learning and optimization based on continuously emerging new data. This ensures timely adaptation to fraud patterns and reduces forgetting of past knowledge. Our model undergoes extensive experiments and evaluations on real-world insurance fraud datasets. The results demonstrate our model has significant advantages in accuracy compared to the state-of-the-art baseline methods, while also exhibiting lower running time and space consumption. Our sources are released at https://github.com/finint/POCL.



Paperid:2499
Authors:Xin Zhang, Yu Liu, Yuming Lin, Qingmin Liao, Yong Li
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, Department of Electronic Engineering, Tsinghua University, Beijing, China, Department of Electronic Engineering, Tsinghua University, Beijing, China, Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, Department of Electronic Engineering, Tsinghua University, Beijing, China
Abstract:
Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are timeconsuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM.



Paperid:2500
Authors:Yuyao Zhang, Ke Guo, Xiao Zhou
Renmin University of China, Renmin University of China, Renmin University of China
Abstract:
Artificial light plays an integral role in modern cities, significantly enhancing human productivity and the efficiency of civilization. However, excessive illumination can lead to light pollution, posing nonnegligible threats to economic burdens, ecosystems, and human health. Despite its critical importance, the exploration of its causes remains relatively limited within the field of artificial intelligence, leaving an incomplete understanding of the factors contributing to light pollution and sustainable illumination planning distant. To address this gap, we introduce a novel framework named Causally Aware Generative Adversarial Networks (CAGAN). This innovative approach aims to uncover the fundamental drivers of light pollution within cities and offer intelligent solutions for optimal illumination resource allocation in the context of sustainable urban development. We commence by examining light pollution across 33,593 residential areas in seven global metropolises. Our findings reveal substantial influences on light pollution levels from various building types, notably grasslands, commercial centers and residential buildings as significant contributors. These discovered causal relationships are seamlessly integrated into the generative modeling framework, guiding the process of generating light pollution maps for diverse residential areas. Extensive experiments showcase CAGAN’s potential to inform and guide the implementation of effective strategies to mitigate light pollution. Our code and data are publicly available at https://github.com/zhangyuuao/Light_Pollution_CAGAN.



Paperid:2501
Authors:Zonghan Zhang, Zijian Zhang, Zhiqian Chen
Department of Computer Science and Engineering, Mississippi State University, Department of Computer Science and Engineering, Mississippi State University, Department of Computer Science and Engineering, Mississippi State University
Abstract:
Due to the significance of its various applications, source localization has garnered considerable attention as one of the most important means to confront diffusion hazards. Multisource localization from a single-snapshot observation is especially relevant due to its prevalence. However, the inherent complexities of this problem, such as limited information, interactions among sources, and dependence on diffusion models, pose challenges to resolution. Current methods typically utilize heuristics and greedy selection, and they are usually bonded with one diffusion model. Consequently, their effectiveness is constrained. To address these limitations, we propose a simulation-based method termed BOSouL. Bayesian optimization (BO) is adopted to approximate the results for its sample efficiency. A surrogate function models uncertainty from the limited information. It takes sets of nodes as the input instead of individual nodes. BOSouL can incorporate any diffusion model in the data acquisition process through simulations. Empirical studies demonstrate that its performance is robust across graph structures and diffusion models. The code is available at https://github.com/XGraph-Team/BOSouL.



Paperid:2502
Authors:Yuying Zhao, Yu Wang, Yi Zhang, Pamela Wisniewski, Charu Aggarwal, Tyler Derr
Vanderbilt University, Vanderbilt University, Vanderbilt University, Vanderbilt University, IBM T. J. Watson Research Center, Vanderbilt University
Abstract:
Online dating platforms have gained widespread popularity as a means for individuals to seek potential romantic relationships. While recommender systems have been designed to improve the user experience in dating platforms by providing personalized recommendations, increasing concerns about fairness have encouraged the development of fairnessaware recommender systems from various perspectives (e.g., gender and race). However, sexual orientation, which plays a significant role in finding a satisfying relationship, is under-investigated. To fill this crucial gap, we propose a novel metric, Opposite Gender Interaction Ratio (OGIR), as a way to investigate potential unfairness for users with varying preferences towards the opposite gender. We empirically analyze a real online dating dataset and observe existing recommender algorithms could suffer from group unfairness according to OGIR. We further investigate the potential causes for such gaps in recommendation quality, which lead to the challenges of group quantity imbalance and group calibration imbalance. Ultimately, we propose a fair recommender system based on re-weighting and re-ranking strategies to respectively mitigate these associated imbalance challenges. Experimental results demonstrate both strategies improve fairness while their combination achieves the best performance towards maintaining model utility while improving fairness.



Paperid:2503
Authors:Chengyuan Zhu, Yiyuan Yang, Kaixiang Yang, Haifeng Zhang, Qinmin Yang, C. L. Philip Chen
Zhejiang University, Hangzhou, China, University of Oxford, Oxfordshire, United Kingdom, South China University of Technology, Guangzhou, China, Research Institute of Tsinghua University, Pearl River Delta, Guangzhou, China, Zhejiang University, Hangzhou, China, South China University of Technology, Guangzhou, China
Abstract:
The application of artificial intelligence technology has greatly enhanced and fortified the safety of energy pipelines, particularly in safeguarding against external threats. The predominant methods involve the integration of intelligent sensors to detect external vibration, enabling the identification of event types and locations, thereby replacing manual detection methods. However, practical implementation has exposed a limitation in current methods their constrained ability to accurately discern the spatial dimensions of external signals, which complicates the authentication of threat events. Our research endeavors to overcome the above issues by harnessing deep learning techniques to achieve a more fine-grained recognition and localization process. This refinement is crucial in effectively identifying genuine threats to pipelines, thus enhancing the safety of energy transportation. This paper proposes a radial threat estimation method for energy pipelines based on distributed optical fiber sensing technology. Specifically, we introduce a continuous multi-view and multi-domain feature fusion methodology to extract comprehensive signal features and construct a threat estimation and recognition network. The utilization of collected acoustic signal data is optimized, and the underlying principle is elucidated. Moreover, we incorporate the concept of transfer learning through a pre-trained model, enhancing both recognition accuracy and training efficiency. Empirical evidence gathered from real-world scenarios underscores the efficacy of our method, notably in its substantial reduction of false alarms and remarkable gains in recognition accuracy. More generally, our method exhibits versatility and can be extrapolated to a broader spectrum of recognition tasks and scenarios.



Paperid:2504
Authors:Zhuang Zhuang, Tianxin Wei, Lingbo Liu, Heng Qi, Yanming Shen, Baocai Yin
Dalian University of Technology Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education, Dalian, China, University of Illinois Urbana Champaign, Pengcheng Laboratory, Dalian University of Technology Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education, Dalian, China, Dalian University of Technology Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education, Dalian, China, Dalian University of Technology Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education, Dalian, China
Abstract:
Next Pointof-Interest (POI) recommendation has been proven effective at utilizing sparse, intricate spatial-temporal trajectory data to recommend subsequent POIs to users. While existing methods commonly alleviate the problem of data sparsity by integrating spatial-temporal context information, POI category features, and social relationships, they largely overlook the fact that the trajectory sequences collected in the datasets are often incomplete. This oversight limits the model’s potential to fully leverage historical context. In light of this background, we propose Trajectory Data Augmentation with Uncertainty (TAU) for Next POI Recommendation. TAU is a general graph-based trajectory data augmentation method designed to complete user mobility patterns by marrying uncertainty estimation into the next POI recommendation task. More precisely, TAU taps into the global transition pattern graph to identify sets of intermediate nodes located between every pair of locations, effectively leveraging edge weights as transition probabilities. During trajectory sequence construction, TAU selectively prompts intermediate nodes, chosen based on their likelihood of occurrence as pseudo-labels, to establish comprehensive trajectory sequences. Furthermore, to gauge the certainty and impact of pseudo-labels on the target location, we introduce a novel confidence-aware calibration strategy using evidence deep learning (EDL) for improved performance and reliability. The experimental results clearly indicate that our TAU method achieves consistent performance improvements over existing techniques across two real-world datasets, verifying its effectiveness as the state-of-the-art approach to the task.



Paperid:2505
Authors:Craig Boutilier, Martin Mladenov, Guy Tennenholtz
Google Research, Google Research, Google Research
Abstract:
Modern recommender systems lie at the heart of complex recommender ecosystems that couple the behavior of users, content providers, vendors, advertisers, and other actors. Despite this, the focus of much recommender systems research and deployment is on the local, myopic optimization of the recommendations made to individual users. This comes at a significant cost to the longterm utility that recommender systems generate for their users. We argue that modeling the incentives and behaviors of these actors, and the interactions among them induced by the recommender systems, is needed to maximize value and improve overall ecosystem health. Moreover, we propose the use of economic mechanism design, an area largely overlooked in recommender systems research, as a framework for developing such models. That said, one cannot apply “vanilla” mechanism design to recommender ecosystem modeling optimization out of the box—the use of mechanism design raises a number of subtle and interesting research challenges. We outline a number of these in this talk (and paper), emphasizing the need to develop nonstandard approaches to mechanism design that intersect with numerous areas of research, including preference modeling, reinforcement learning and exploration, behavioral economics, and generative AI, among others.



Paperid:2506
Authors:Pin-Yu Chen
IBM Research
Abstract:
In datarich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models and can even learn general task-agnostic representations for efficient finetuning to downstream tasks. However, deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning. This paper provides an overview of model reprogramming to bridge this gap. Model reprogramming enables resource-efficient cross-domain machine learning by repurposing and reusing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning, where the source and target domains can be vastly different. In many applications, model reprogramming outperforms transfer learning and training from scratch. This paper elucidates the methodology of model reprogramming, summarizes existing use cases, provides a theoretical explanation of the success of model reprogramming, and concludes with a discussion on open-ended research questions and opportunities.



Paperid:2507
Authors:Eugene C. Freuder
Insight Centre for Data Analytics, University College Cork, Cork, Ireland
Abstract:
Many problems, from Sudoku to factory scheduling, can be regarded as constraint satisfaction problems. A key component of real world problem solving is a conversation between a constraint programming expert and a problem domain expert to specify the problem to be solved. This presentation argues that the time is ripe for progress in automating the constraint programmer side of this conversation and suggests promising avenues for this pursuit.



Paperid:2508
Authors:Pat Langley
Institute for the Study of Learning and Expertise
Abstract:
This paper poses the challenge of developing and evaluating integrated systems for computational scientific discovery. We note some distinguishing characteristics of discovery tasks, examine eight component abilities, review previous successes at partial integration, and consider hurdles the AI research community must leap to transform the vision for integrated discovery into reality. In closing, we discuss promising scientific domains in which to test such computational artifacts.



Paperid:2509
Authors:Omer Lev
BenGurion University of the Negev
Abstract:
In the last few years, a lot of the activity of the computational social choice community has focused on novel mechanisms for reaching decisions by large groups of people. While this research makes meaningful scientific contributions, many of these mechanisms are not quite useful in realistic decisionmaking settings. Moreover, their radicalism ignores the centuries-old experience we have with large-scale human decision-making, and what it teaches us about what works. We believe it is important the community engage with mechanisms which are widely-used in the real world, as they may hold a key to a deeper understanding of how people reach decisions and the way that helps them do that productively. Moreover, letting the community bring its analysis and understanding to these will allow for algorithmic suggestions that have some chance of being implemented (and, thus, can contribute to the public debate on these topics). In particular, we highlight the relatively less-investigated role of parties and grouping of voters and candidates, and the role of executive capacity in analyzing decision-making structures.



Paperid:2510
Authors:Xu Tan, Tao Qin, Jiang Bian, Tie-Yan Liu, Yoshua Bengio
Microsoft Research, Microsoft Research, Microsoft Research, Microsoft Research, Mila University of Montreal
Abstract:
Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y. The target Y (e.g., text, speech, music, image, video) is usually highdimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y' (an abstraction/representation of Y) from X and then generates Y from Y'. During training, Y' is obtained from Y through either handcrafted rules or self-supervised learning and is used to learn X-->Y' and Y'-->Y. Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y') of the target data Y for data generation while traditional representation learning handles the abstraction (X') of source data X for data understanding; 2) both the processes of Y'-->Y in regeneration learning and X-->X' in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y' in regeneration learning and from X' to Y in representation learning are simpler than the direct mapping from X to Y. We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.



Paperid:2511
Authors:Hadi Hosseini
Pennsylvania State University
Abstract:
Fairness is one of the most desirable societal principles in collective decisionmaking. It has been extensively studied in the past decades for its axiomatic properties and has received substantial attention from the multiagent systems community in recent years for its theoretical and computational aspects in algorithmic decision-making. However, these studies are often not sufficiently rich to capture the intricacies of human perception of fairness in the ambivalent nature of the real-world problems. We argue that not only fair solutions should be deemed desirable by social planners (designers), but they should be governed by human and societal cognition, consider perceived outcomes based on human judgement, and be verifiable. We discuss how achieving this goal requires a broad transdisciplinary approach ranging from computing and AI to behavioral economics and human-AI interaction. In doing so, we identify shortcomings and long-term challenges of the current literature of fair division, describe recent efforts in addressing them, and more importantly, highlight a series of open research directions.



Paperid:2512
Authors:Edith Elkind, Svetlana Obraztsova, Nicholas Teh
University of Oxford Alan Turing Institute, Carleton University, University of Oxford
Abstract:
Multiwinner voting captures a wide variety of settings, from parliamentary elections in democratic systems to product placement in online shopping platforms. There is a large body of work dealing with axiomatic characterizations, computational complexity, and algorithmic analysis of multiwinner voting rules. Although many challenges remain, significant progress has been made in showing existence of fair and representative outcomes as well as efficient algorithmic solutions for many commonly studied settings. However, much of this work focuses on singleshot elections, even though in numerous real-world settings elections are held periodically and repeatedly. Hence, it is imperative to extend the study of multiwinner voting to temporal settings. Recently, there have been several efforts to address this challenge. However, these works are difficult to compare, as they model multi-period voting in very different ways. We propose a unified framework for studying temporal fairness in this domain, drawing connections with various existing bodies of work, and consolidating them within a general framework. We also identify gaps in existing literature, outline multiple opportunities for future work, and put forward a vision for the future of multiwinner voting in temporal settings.



Paperid:2513
Authors:Shengxin Liu, Xinhang Lu, Mashbat Suzuki, Toby Walsh
Harbin Institute of Technology, Shenzhen, UNSW Sydney, UNSW Sydney, UNSW Sydney
Abstract:
The fair allocation of resources to agents is a fundamental problem in society and has received significant attention and rapid developments from the game theory and artificial intelligence communities in recent years. The majority of the fair division literature can be divided along at least two orthogonal directions: goods versus chores, and divisible versus indivisible resources. In this survey, besides describing the state of the art, we outline a number of interesting open questions in three mixed fair division settings: (i) indivisible goods and chores, (ii) divisible and indivisible goods (i.e., mixed goods), and (iii) fair division of indivisible goods with subsidy.



Paperid:2514
Authors:Mayank Vatsa, Anubhooti Jain, Richa Singh
IIT Jodhpur, India, IIT Jodhpur, India, IIT Jodhpur, India
Abstract:
Recently, transformers have become incredibly popular in computer vision and visionlanguage tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.



Paperid:2515
Authors:Mohammad Abdulaziz
King's College London
Abstract:
Interactive theorem provers (ITPs) are computer programs in which axioms and a conjecture are stated in a formal language, and a user provides the ITP with relatively highlevel steps of a formal proof for the conjecture. Then, by invoking automated theorem provers, the ITP tries to generate low-level steps that fill the gaps between the steps provided by the user, thus forming a complete formal proof of the conjecture. The ITP also checks the entire formal proof against the axioms, thus confirming the soundness of all derivations in the formal proof. In this talk, I will discuss the existing opportunities and potential benefits to applying ITPs to reason about and verify AI concepts, algorithms, and software. I will also discuss the challenges we have to being able to apply ITPs in AI and reap those benefits. I will do so by discussing a number of my previous projects on the application of ITPs to different AI concepts, algorithms, and software systems. These projects span different areas of planning (classical planning, temporal planning, and planning under uncertainty) as well as algorithms with applications in algorithmic game theory, like general graph matching and online matching.



Paperid:2516
Authors:Gregor Behnke
University of Amsterdam
Abstract:
Planning is the act of deliberative thinking before acting. It is based on a symbolic model of the world and the options to act in it, usually defined in functionfree first-order logic. The task is to find a sequence of actions (a plan) that leads from a given current state to a desired goal state. The basic, purely physical description may be augmented with a partially ordered grammar-like structure (a Hierarchical Task Network or HTN), which can describe expert knowledge, or practical, legal, or operational requirements. In this talk, I will survey a variety of methods for automatically deriving plans using symbolic methods for planning -- from both my past and future research. These symbolic methods -- in some sense -- translate planning problems into other, simpler symbolic representations and reason over them to find plans. As a basis for these methods, I will firstly introduce relevant theoretical results on planning. First, I will discuss the expressive power of planning formalisms (ECAI'14, ICAPS'16) and second, the computational complexity of HTN planning and related tasks such as HTN plan verification, plan modification, and plan recognition (ICAPS'15, ICAPS'16). Based on these theoretical results, I will develop why SAT-based HTN planning is possible and how it can be implemented. To this end, I will survey several of my publications at top-tier conferences, including papers at ICAPS'17, AAAI'18, AAAI'19, IJCAI'19, AAAI'20, and ICAPS'21 -- in which I developed an highly SAT-based planner for HTN problems including the ability to find optimal plans as well as the grounding as a preprocessing step. Here I will also give an outlook on future developments and new ideas that I propose for SAT-based planning -- including the exploitation of structures in plan (e.g.\ landmarks or operator-counting constraints). Next, I will present the idea of expressing lifted classical planning as SAT (ICAPS'22). The resulting planner LiSAT was the first lifted SAT-based planner -- and proved highly efficient and outperformed all other lifted planners at the time of publication. Notably, LiSAT was the first planner (lifted or grounded) and still is the only one to solve the challenging OrganicSynthesis benchmark -- and could even prove optimality for all plans. I will also outline future ideas to further improve the efficiency of LiSAT. Lastly, I introduce the notion of planning with symbolic symbolic representations (AAAI'21 and ICAPS'23). Here one uses Binary Decision Diagrams to encode large sets of states efficiently. For expressing the additional structure encoded by HTNs, I show how BDDs can be suitably integrated into finite automata. Based on this representation, an efficient and optimal planning algorithm can be derived. Additionally, I show how this algorithm can be extended to also cover oversubscription planning.



Paperid:2517
Authors:Lu Cheng
University of Illinois Chicago
Abstract:
Significant progress in the field of fair machine learning (ML) has been made to counteract algorithmic discrimination against marginalized groups. However, fairness remains an active research area that is far from settled. One key bottleneck is the implicit assumption that environments, where ML is developed and deployed, are certain and reliable. In a world that is characterized by volatility, uncertainty, complexity, and ambiguity, whether what has been developed in algorithmic fairness can still serve its purpose is far from obvious. In this talk, I will first discuss how to improve algorithmic fairness under two kinds of predictive uncertainties, i.e., aleatoric uncertainty (i.e., randomness and ambiguity in the data) and epistemic uncertainty (i.e., a lack of data or knowledge), respectively. The former regards historical bias reflected in the data and the latter corresponds to the bias perpetuated or amplified during model training due to lack of data or knowledge. In particular, the first work studies pushing the fairnessutility trade-off through aleatoric uncertainty, and the second work investigates fair few-shot learning. The last work introduces coverage-based fairness that ensures different groups enjoy identical treatment and receive equal coverage.



Paperid:2518
Authors:Kaize Ding
Northwestern University
Abstract:
My research strives to develop fundamental graphcentric learning algorithms to reduce the need for human supervision in low-resource scenarios. The focus is on achieving effective and reliable data-efficient learning on graphs, which can be summarized into three facets: (1) graph weakly-supervised learning; (2) graph few-shot learning; and (3) graph self-supervised learning.



Paperid:2519
Authors:Xinya Du
University of Texas at Dallas
Abstract:
Neural models, including large language models (LLMs), achieve superior performance on logical reasoning tasks such as question answering. To elicit reasoning capabilities from LLMs, recent works propose using the chainof-thought (CoT) mechanism to generate both the reasoning chain and the answer, which enhances the model’s capabilities in conducting reasoning. However, due to LLM’s uninterpretable nature and the extreme flexibility of free-form explanations, several challenges remain: such as struggling with inaccurate reasoning, hallucinations, and not aligning with human preferences. In this talk, we will focus on (1) our design of leveraging structured information (that is grounded to the context), for the explainable complex question answering and reasoning; (2) our multi-module interpretable framework for inductive reasoning, which conducts step-wise faithful reasoning with iterative feedback.



Paperid:2520
Authors:Tejas Gokhale
University of Maryland, Baltimore County
Abstract:
Models that learn from data are widely and rapidly being deployed today for realworld use, but they suffer from unforeseen failures due to distribution shift, adversarial attacks, noise and corruption, and data scarcity. But many failures also occur because many modern AI tasks require reasoning beyond pattern matching -- and such reasoning abilities are difficult to formulate as data-based input-output function fitting. The reliability problem has become increasingly important under the new paradigm of semantic ``multimodal'' learning. My research provides avenues to develop robust and reliable computer vision systems, particularly by leveraging the interactions between vision and language. In this AAAI New Faculty highlights talk, I will cover three thematic areas of my research, ranging from robustness in computer vision, open-domain reliability in visual reasoning, and challenges and opportunities in evaluation of generative models. Readers are encouraged to refer to my website (www.tejasgokhale.com) for more details and updates from my lab's activities towards the goal of robust visual understanding.



Paperid:2521
Authors:Yunhui Guo
The University of Texas at Dallas
Abstract:
Building autonomous agents that can process massive amounts of realtime sensor-captured data is essential for many real-world applications including autonomous vehicles, robotics and AI in medicine. As the agent often needs to explore in a dynamic environment, it is thus a desirable as well as challenging goal to enable the agent to learn over time without performance degradation. Continual learning aims to build a continual learner which can learn new concepts over the data stream while preserving previously learnt concepts. In the talk, I will survey three pieces of my recent research on continual learning (i) supervised continual learning, (ii) unsupervised continual learning, and (iii) multi-modal continual learning. In the first work, I will discuss a supervised continual learning algorithm called MEGA which dynamically balances the old tasks and the new task. In the second work, I will discuss unsupervised continual learning algorithms which learn representation continually without access to the labels. In the third work, I will elaborate an efficient continual learning algorithm that can learn multiple modalities continually without forgetting.



Paperid:2522
Authors:Josiah P. Hanna
The University of Wisconsin -- Madison
Abstract:
A critical challenge for the widescale adoption of reinforcement learning (RL) is the need to give domain experts assurance that learned policies will improve decisionmaking -- and not lead to unacceptable behavior. To meet this challenge, my work aims to develop new methods for offline policy evaluation in real world RL domains. There has been much recent interest in offline evaluation and many advances. However, recent benchmarking efforts have also shown that there remains a substantial gap between current state-of-the-art methods and real world domains such as robotics. Towards scalable offline evaluation, my group is investigating the use of methods for abstraction and representation learning. In this new faculty highlight, I will present our recent results that show the promise of this direction for scaling offline evaluation in RL domains. I will then describe future directions in this line of that work which will further realize the promise of offline policy evaluation for increasing confidence in deployed RL.



Paperid:2523
Authors:Trong Nghia Hoang
Washington State University
Abstract:
The increasingly decentralized and private nature of data in our digital society has motivated the development of personalized, collaborative intelligent systems that enable knowledge aggregation across multiple data owners while accommodating for their data privacy and system constraints. However, collaborative learning has only been investigated in simple and limited settings: isolated task scenarios where learning begins from scratch and does not build on prior expertise; learned model is represented in taskspecific forms which are not generalizable to unseen, emerging scenarios; and more often, a universal model representation is assumed across collaborators, ignoring their local compute constraints or input representations. This restricts its practicality in continual learning scenarios with limited task data, which demand continuous adaptation and knowledge transfer across different information silos, tasks, and learning models, as well as the utilization of prior solution expertises. To overcome these limitations, my research has been focused on developing effective and scalable resource-aware collaborative learning frameworks across heterogeneous systems.



Paperid:2524
Authors:Wei Hu
University of Michigan
Abstract:
Deep learning has exhibited a number of surprising generalization phenomena that are not captured by classical statistical learning theory. This talk will survey some of my work on the theoretical characterizations of several such intriguing phenomena: (1) Implicit regularization: A major mystery in deep learning is that deep neural networks can often generalize well despite their excessive expressive capacity. Towards explaining this mystery, it has been suggested that commonly used gradientbased optimization algorithms enforce certain implicit regularization which effectively constrains the model capacity. (2) Benign overfitting: In certain scenarios, a model can perfectly fit noisily labeled training data, but still archives near-optimal test error at the same time, which is very different from the classical notion of overfitting. (3) Grokking: In certain scenarios, a model initially achieves perfect training accuracy but no generalization (i.e. no better than a random predictor), and upon further training, transitions to almost perfect generalization. Theoretically establishing these properties often involves making appropriate high-dimensional assumptions on the problem as well as a careful analysis of the training dynamics.



Paperid:2525
Authors:Mengdi Huai
Iowa State University
Abstract:
Recent years have seen a surge in research that develops and applies machine learning algorithms to create intelligent learning systems. However, traditional machine learning algorithms have primarily focused on optimizing accuracy and efficiency, and they often fail to consider how to foster trustworthiness in their design. As a result, machine learning models usually face a trust crisis in realworld applications. Driven by these urgent concerns about trustworthiness, in this talk, I will introduce my research efforts towards the goal of making machine learning trustworthy. Specifically, I will delve into the following key research topics: security vulnerabilities and robustness, model explanations, and privacy-preserving mechanisms.



Paperid:2526
Authors:Wei Jin
Emory University
Abstract:
Many learning tasks in Artificial Intelligence (AI) require dealing with graph data, ranging from biology and chemistry to finance and education. As powerful deep learning tools for graphs, graph neural networks (GNNs) have demonstrated remarkable performance in various graphrelated applications. Despite the significant accomplishments of GNNs, recent studies have highlighted that their efficiency and effectiveness face significant challenges such as adversarial robustness and scalability, which are fundamentally linked to data. While major attention has been devoted to improving GNNs from the model perspective, the potential of directly enhancing data has often been overlooked. It underscores a critical gap in GNN research---while model improvements are undoubtedly important, we also need to recognize and address the data-related factors contributing to the challenges. Hence, my research is to investigate solutions for these challenges from the data perspective, employing strategies such as data characterization, reduction, augmentation, transformation, and detection.



Paperid:2527
Authors:Ashiqur R. KhudaBukhsh
Rochester Institute of Technology
Abstract:
This talk surveys three related research contributions that shed light on the current US political divide: 1. a novel machinetranslation-based framework to quantify political polarization; 2. an analysis of disparate media portrayal of US policing in major cable news outlets; and 3. a novel perspective of vicarious offense that examines a timely and important question -- how well do Democratic-leaning users perceive what content would be deemed as offensive by their Republican-leaning counterparts or vice-versa?



Paperid:2528
Authors:Yen-Ling Kuo
University of Virginia
Abstract:
For robots to robustly and flexibly interact with humans, they need to acquire skills to use across scenarios. One way to enable the generalization of skills is to learn representations that are useful for downstream tasks. Learning a representation for interactions requires an understanding of what (e.g., objects) as well as how (e.g., actions, controls, and manners) to interact with. However, most existing language or visual representations mainly focus on objects. To enable robust humanrobot interactions, we need a representation that is not just grounded at the object level but to reason at the action level. The ability to reason about an agent’s own actions and other’s actions will be crucial for long-tail interactions. My research focuses on leveraging the compositional nature of language and reward functions to learn representations that generalize to novel scenarios. Together with the information from multiple modalities, the learned representation can reason about task progress, future behaviors, and the goals/beliefs of an agent. The above ideas have been demonstrated in my research on building robots to understand language and engage in social interactions.



Paperid:2529
Authors:Fanghui Liu
University of Warwick
Abstract:
The conventional wisdom of simple models in machine learning misses the bigger picture, especially overparameterized neural networks (NNs), where the number of parameters are much larger than the number of training data. Our goal is to explore the mystery behind over-parameterized models from a theoretical side. In this talk, I will discuss the role of over-parameterization in neural networks, to theoretically understand why they can perform well. First, I will discuss the role of over-parameterization in neural networks from the perspective of models, to theoretically understand why they can genralize well. Second, the effects of over-parameterization in robustness, privacy are discussed. Third, I will talk about the over-parameterization from kernel methods to neural networks in a function space theory view. Besides, from classical statistical learning to sequential decision making, I will talk about the benefits of over-parameterization on how deep reinforcement learning works well for function approximation. Potential future directions on theory of over-parameterization ML will also be discussed.



Paperid:2530
Authors:Mingrui Liu
George Mason University
Abstract:
The current analysis of federated optimization algorithms for training deep neural networks assumes that the data is nonsequential (e.g., images), which incurs a smooth loss objective. In contrast, edge devices generate lots of sequential data every day, where these sequences exhibit significant sequential correlation at different time stamps (e.g., text messages). In order to learn from such sequential data, people typically use a class of neural networks that is inherently nonsmooth, with a potentially unbounded smoothness parameter. Examples include recurrent neural networks, long-short-term memory networks, and transformers. It remains unclear how to design provably efficient algorithms for training these neural networks to learn from sequential data. My goal is to lay the algorithmic foundation of federated learning with sequential data, which contributes novel algorithms for learning from a range of real-world sequential data (e.g., natural language, electronic health record, transportation, time series, etc.) using state-of-the-art deep neural networks. In this talk, I will first motivate the problem by showing that the transformer, which is widely used for sequential data learning, has an unbounded smooth landscape. Then, I will introduce provably efficient federated deep learning algorithms in the presence of unbounded smoothness. In particular, I will introduce a few efficient algorithms for various settings of federated learning, including homogeneous data, heterogeneous data, and partial client participation. The main result is twofold. First, we show that the designed algorithms provably small computational and communication complexities. Second, we establish fundamental hardness results in the unbounded smoothness setting. Ultimately, I will discuss the future challenges of extending our research framework from small-scale neural networks to large language models.



Paperid:2531
Authors:Jing Ma
Case Western Reserve University
Abstract:
Graphs (i.e., networks) are ubiquitous in daily life, as they can effectively model a plethora of realworld systems with connected units, such as social networks and biological networks. Recent years have witnessed rapid development in graph-based machine learning (GML) in various high-impact domains. Currently, the mainstream GML methods are based on statistical learning, e.g., utilizing the statistical correlations between node features, graph structure, and labels for node classification. However, statistical learning has been widely criticized for only capturing the superficial relations between variables in the data system, and consequently, rendering the lack of trustworthiness in real-world applications. Therefore, it is crucial to understand the causality in the data system and the learning process. Causal inference is the discipline that investigates the causality inside a system, for example, to identify and estimate the causal effect of a certain treatment (e.g., wearing a face mask) on an important outcome (e.g., COVID-19 infection). Involving the concepts and philosophy of causal inference in ML methods is often considered significant for human-level intelligence and can serve as the foundation of artificial intelligence (AI). However, most traditional causal inference studies rely on strong assumptions, and focus on independent and identically distributed (i.i.d.) data, while causal inference on graphs is faced with many barriers. Therefore, we aim to bridge the gap between causal inference and GML.



Paperid:2532
Authors:Pranava Madhyastha
City, University of London
Abstract:
Language acquisition and utilization transcend the mere exchange of lexical units. Visual cues, prosody, gestures, body movements, and context play an undeniably crucial role. Humans naturally communicate multimodally, employing multiple channels and synthesizing information from diverse modalities. My research delves into the characterization and construction of multimodal models that seamlessly integrate data from multiple independent modalities. I will cover recent work that highlights the challenges, achievements, and opportunities towards developing capable multimodal discursive models.



Paperid:2533
Authors:Giuseppe Marra
KU Leuven
Abstract:
The integration of learning and reasoning is one of the key challenges in artificial intelligence and machine learning today. The area of NeuroSymbolic AI (NeSy) tackles this challenge by integrating symbolic reasoning with neural networks. In our recent work, we provided an introduction to NeSy by drawing several parallels to another field that has a rich tradition in integrating learning and reasoning, namely Statistical Relational Artificial Intelligence (StarAI).



Paperid:2534
Authors:Christoforos Mavrogiannis
University of Michigan
Abstract:
The integration of advances from machine learning and computer vision with the classical autonomy stack has brought successful robot deployments in fulfilment, manufacturing, and transportation. However, unstructured and dynamic environments such as pedestrian spaces and streets, workplaces, and homes pose additional challenges such as modeling human behavior, understanding user perceptions, and ensuring human safety and comfort. My work addresses such challenges to enable robots to fluently work with and around people to increase productivity and assist users.



Paperid:2535
Authors:Alberto Maria Metelli
Politecnico di Milano
Abstract:
Inverse reinforcement learning (IRL) has seen significant advancements in recent years. This class of approaches aims to efficiently learn the underlying reward function that rationalizes the behavior exhibited by expert agents, often represented by humans. In contrast to mere behavioral cloning, the reconstruction of a reward function yields appealing implications, as it allows for more effective interpretability of the expert’s decisions and provides a transferable specification of the expert’s objectives for application in even different environments. Unlike the wellunderstood field of reinforcement learning (RL) from a theoretical perspective, IRL still grapples with limited understanding, significantly constraining its applicability. A fundamental challenge in IRL is the inherent ambiguity in selecting a reward function, given the existence of multiple candidate functions, all explaining the expert’s behavior. In this talk, I will survey three of my papers that have made notable contributions to the IRL field: “Provably Efficient Learning of Transferable Rewards”, “Towards Theoretical Understanding of Inverse Reinforcement Learning”, and “Inverse Reinforcement Learning with Sub-optimal Experts". The central innovation introduced by the first paper is a novel formulation of the IRL problem that overcomes the issue of ambiguity. IRL is reframed as the problem of learning the feasible reward set, which is the set of all rewards that can explain the expert’s behavior. This approach postpones the selection of the reward function, thereby circumventing the ambiguity issues. Furthermore, the feasible reward set exhibits convenient geometric properties that enable the development of efficient algorithms for its computation. Building on this novel formulation of IRL, the second paper addresses the problem of efficiently learning the feasible reward set when the environment and the expert’s policy are not known in advance. It introduces a novel way to assess the dissimilarity between feasible reward sets based on the Hausdorff distance and presents a new PAC (probabilistic approximately correct) framework. The most significant contribution of this paper is the introduction of the first sample complexity lower bound, which highlights the challenges inherent in the IRL problem. Deriving this lower bound necessitated the development of novel technical tools. The paper also demonstrates that when a generative model of the environment is available, a uniform sampling strategy achieves a sample complexity that matches the lower bound, up to logarithmic factors. Finally, in the third paper, the IRL problem in the presence of sub-optimal experts is investigated. Specifically, the paper assumes the availability of multiple sub-optimal experts, in addition to the expert agent, which provides additional demonstrations, associated with a known quantification of the maximum amount of sub-optimality. The paper shows that this richer information mitigates the ambiguity problem, significantly reducing the size of the feasible reward set while retaining its favorable geometric properties. Furthermore, the paper explores the associated statistical problem and derives novel lower bounds for sample complexity, along with almost matching algorithms. These selected papers represent notable advancements in IRL, contributing to the establishment of a solid theoretical foundation for IRL and extending the framework to accommodate scenarios with sub-optimal experts.



Paperid:2536
Authors:Melanie Weber
Harvard University
Abstract:
A key challenge in Machine Learning (ML) is the identification of geometric structure in highdimensional data. Most algorithms assume that data lives in a high-dimensional vector space; however, many applications involve non-Euclidean data, such as graphs, strings and matrices, or data whose structure is determined by symmetries in the underlying system. Here, we discuss methods for identifying geometric structure in data and how leveraging data geometry can give rise to efficient ML algorithms with provable guarantees.



Paperid:2537
Authors:Tsui-Wei (Lily) Weng
UCSD
Abstract:
Deep neural networks (DNNs) have achieved unprecedented success across many scientific and engineering fields in the last decades. Despite its empirical success, unfortunately, recent studies have shown that there are various failure modes and blindspots in DNN models which may result in unexpected serious failures and potential harms, e.g. the existence of adversarial examples and small perturbations. This is not acceptable especially for safety critical and high stakes applications in the realworld, including healthcare, self-driving cars, aircraft control systems, hiring and malware detection protocols. Moreover, it has been challenging to understand why and when DNNs will fail due to their complicated structures and black-box behaviors. Lacking interpretability is one critical issue that may seriously hinder the deployment of DNNs in high-stake applications, which need interpretability to trust the prediction, to understand potential failures, and to be able to mitigate harms and eliminate biases in the model. To make DNNs trustworthy and reliable for deployment, it is necessary and urgent to develop methods and tools that can (i) quantify and improve their robustness against adversarial and natural perturbations, and (ii) understand their underlying behaviors and further correct errors to prevent injuries and damages. These are the important first steps to enable Trustworthy AI and Trustworthy Machine Learning. In this talk, I will survey a series of research efforts in my lab contributed to tackling the grand challenges in (i) and (ii). In the first part of my talk, I will overview our research effort in Robust Machine Learning since 2017, where we have proposed the first attack-agnostic robustness evaluation metric, the first efficient robustness certification algorithms for various types of perturbations, and efficient robust learning algorithms across supervised learning to deep reinforcement learning. In the second part of my talk, I will survey a series of exciting results in my lab on accelerating interpretable machine learning and explainable AI. Specifically, I will show how we could bring interpretability into deep learning by leveraging recent advances in multi-modal models. I'll present recent works in our group on automatically dissecting neural networks with open vocabulary concepts, designing interpretable neural networks without concept labels, and briefly overview our recent efforts on demystifying black-box DNN training process, automated neuron explanations for Large Language Models and the first robustness evaluation of a family of neuron-level interpretation techniques.



Paperid:2538
Authors:Huaxiu Yao
University of North Carolina at Chapel Hill
Abstract:
The realworld deployment of machine learning algorithms often poses challenges due to shifts in data distributions and tasks. These shifts can lead to a degradation in model performance, as the model may not have encountered such changes during training. Additionally, they can make it difficult for the model to generalize to new scenarios and can result in poor performance in real-world applications. In this talk, I will present our research on building machine learning models that are highly generalizable and easily adaptable to different shifts. Specifically, I will first discuss our approach to improving out-of-distribution robustness and mitigating spurious correlations by training environment-invariant models through selective augmentation and post-hoc rectification. Second, I will present our techniques for continuous and rapid adaptation of models to new tasks and environments. This includes methods to facilitate compositional generalization and adaptation by extracting relationships from historical observations and to enhance reliable adaptation even in the face of imperfect observations. Additionally, I will showcase our successful practices for addressing shifts in real-world applications, such as in the healthcare, e-commerce, and transportation industries. The talk will also touch upon the remaining challenges and outline future research directions in this area.



Paperid:2539
Authors:Quanming Yao
Tsinghua University
Abstract:
Relational structured data is a way of representing knowledge using nodes and edges, while also capturing the meaning of that knowledge in a structured form that can be used for machine learning. Compared with vision and natural language data, relational structured data represents and manipulates structured knowledge, which can be beneficial for tasks that involve reasoning or inference. On the other hand, vision and NLP deal more with unstructured data (like images and text), and they often require different types of models and algorithms to extract useful information or features from the data. Humanlike Learning develops methods that can harness relational structures and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. With Human-like Learning, the learning algorithm is efficient and can adapt to new or unseen situations, which is crucial in real-world applications where environments may change unpredictably. Moreover, the models are easier for humans to understand and interpret, which is important for transparency and trust in AI systems. In this talk, we present our recent attempts towards human-like learning from relational structured data.



Paperid:2540
Authors:Wenbin Zhang
Florida International University
Abstract:
Recent works in artificial intelligence fairness attempt to mitigate discrimination by proposing constrained optimization programs that achieve parity for some fairness statistics. Most assume the availability of class label which is impractical in many realworld applications such as precision medicine, actuarial analysis and recidivism prediction. To this end, this talk revisits fairness and reveals idiosyncrasies of existing fairness literature assuming the availability of class label that limits their real-world utility. The primary artifacts are formulating fairness with censorship to account for scenarios where the class label is not guaranteed, and a suite of corresponding new fairness notions, algorithms, and theoretical constructs to bridge the gap between the design of a ``fair'' model in the lab and its deployment in the real-world.



Paperid:2541
Authors:Han Zhao
University of Illinois Urbana-Champaign
Abstract:
In this talk I will discuss our recent work on characterizing the inherent tradeoff between fairness and accuracy in both classification and regression problems. I will also present a postprocessing algorithm that derives optimal fair predictors from Bayes score functions.



Paperid:2542
Authors:Yue Zhao
University of Southern California
Abstract:
Anomaly detection (AD), often termed outlier detection, is a key machine learning (ML) task, aiming to identify uncommon yet crucial patterns in data. With the increasing complexity of the modern world, the applications of AD span wide—from NASA's spacecraft monitoring to early patient prioritization at University of Pittsburgh Medical Center. Technology giants like Google and Amazon also leverage AD for service disruption identification. Here, I will traverse my AD works with promising new directions, particularly emphasizing reproducible benchmarks (Part 1), automated algorithms (Part 2), and scalable systems (Part 3).



Paperid:2543
Authors:Dawei Zhou
Virginia Tech, VA, USA
Abstract:
Recent years have witnessed a dramatic increase in a class of security threats known as "insider threats". These threats occur when individuals with authorized access to an organization's network engage in harmful activities, potentially leading to the disclosure of vital information or adversely affecting the organization's systems (e.g., financial loss, system crashes, and national security challenges). Distinct from other types of terror attacks, combating insider threats exhibits several unique challenges, including (1) rarity, (2) nonseparability, (3) label scarcity, (4) dynamics, and (5) heterogeneity, making themselves extremely difficult to identify and mitigate. We target the challenging problem of combating insider threats in open-world environments by leveraging a variety of data sources (e.g., internal system logs, employee networks, human trafficking, and smuggling networks). To effectively combat these intricate threats, we introduce an interactive learning mechanism that is composed of three mutually beneficial learning modules: insider identification, insider monitoring, and data augmentation. Each module plays a crucial role in enhancing our ability to detect and mitigate insider threats, thereby contributing to a more secure and resilient organizational environment.



Paperid:2544
Authors:Zainab Akhtar, Umair Qazi, Aya El-Sakka, Rizwan Sadiq, Ferda Ofli, Muhammad Imran
Qatar Computing Research Institute, HBKU, Doha, Qatar, Qatar Computing Research Institute, HBKU, Doha, Qatar, Qatar Computing Research Institute, HBKU, Doha, Qatar, Central Asian University, Tashkent, Uzbekistan, Qatar Computing Research Institute, HBKU, Doha, Qatar, Qatar Computing Research Institute, HBKU, Doha, Qatar
Abstract:
The absence of comprehensive situational awareness information poses a significant challenge for humanitarian organizations during their response efforts. We present Flood Insights, an endto-end system that ingests data from multiple non-traditional data sources such as remote sensing, social sensing, and geospatial data. We employ state-of-the-art natural language processing and computer vision models to identify flood exposure, ground-level damage and flood reports, and most importantly, urgent needs of affected people. We deploy and test the system during a recent real-world catastrophe, the 2022 Pakistan floods, to surface critical situational and damage information at the district level. We validated the system's effectiveness through geographic regression analysis using official ground-truth data, showcasing its strong performance and explanatory power. Moreover, the system was commended by the United Nations Development Programme stationed in Pakistan, as well as local authorities, for pinpointing hard-hit districts and enhancing disaster response.



Paperid:2545
Authors:Jayachandu Bandlamudi, Kushal Mukherjee, Prerna Agarwal, Ritwik Chaudhuri, Rakesh Pimplikar, Sampath Dechu, Alex Straley, Anbumunee Ponniah, Renuka Sindhgatta
IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Software, IBM Software, IBM Research
Abstract:
In the realm of business automation, digital assistants/chatbots are emerging as the primary method for making automation software accessible to users in various business sectors. Access to automation primarily occurs through APIs and RPAs. To effectively convert APIs and RPAs into chatbots on a larger scale, it is crucial to establish an automated process for generating data and training models that can recognize user intentions, identify questions for conversational slot filling, and provide recommendations for subsequent actions. In this paper, we present a technique for enhancing and generating natural language conversational artifacts from API specifications using large language models (LLMs). The goal is to utilize LLMs in the "build" phase to assist humans in creating skills for digital assistants. As a result, the system doesn't need to rely on LLMs during conversations with business users, leading to efficient deployment. Experimental results highlight the effectiveness of our proposed approach. Our system is deployed in the IBM Watson Orchestrate product for general availability.



Paperid:2546
Authors:Jiří Bednář, Jakub Náplava, Petra Barančíková, Ondřej Lisický
Seznam.cz, Seznam.cz, Seznam.cz, Seznam.cz
Abstract:
This article focuses on the development and evaluation of Smallsized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given the limited availability of labeled Czech data, alternative approaches, including pre-training, knowledge distillation, and unsupervised contrastive fine-tuning, are investigated. Comprehensive intrinsic and extrinsic analyses are conducted, showcasing the competitive performance of our models compared to significantly larger counterparts, with approximately 8 times smaller size and 5 times faster speed than conventional Base-sized models. To promote cooperation and reproducibility, both the models and the evaluation pipeline are made publicly accessible. Ultimately, this article presents practical applications of the developed sentence embedding models in Seznam.cz, the Czech search engine. These models have effectively replaced previous counterparts, enhancing the overall search experience for instance, in organic search, featured snippets, and image search. This transition has yielded improved performance.



Paperid:2547
Authors:Thibault Falque, Gilles Audemard, Christophe Lecoutre, Bertrand Mazure
Exakis Nelite CRIL, CNRS & Université d'Artois, France, CRIL, CNRS & Université d'Artois, France, CRIL, CNRS & Université d'Artois, France, CRIL, CNRS & Université d'Artois, France
Abstract:
More than ever, air transport players (i.e., airline and airport companies) in an intensely competitive climate need to benefit from a carefully optimized management of airport resources to improve the quality of service and control the induced costs. In this paper, we investigate the Airport Checkin Desk Assignment Problem. We propose a Constraint Programming (CP) model for this problem, and present some promising experimental results from data coming from ADP (Aéroport de Paris). Our works are deployed in a preprod environment since 1 year.



Paperid:2548
Authors:Kyoung Jun Lee, Baek Jeong, Suhyeon Kim, Dam Kim, Dongju Park
Kyung Hee University Harex InfoTech, Kyung Hee University, Kyung Hee University, Harex InfoTech, Harex InfoTech
Abstract:
One of the most crucial capabilities in the commercial sector is a personalized prediction of a customer's next purchase. We present a novel method of creating a commerce intelligence engine that caters to multiple merchants intended for the UB Platform, managed by epayment company Harex InfoTech. To cultivate this intelligence, we utilized payment receipt data and created a Natural Language Processing (NLP)-based commerce model using a Transformer to accommodate multinational and merchant trade. Our model, called General Commerce Intelligence (GCI), provides a range of services for merchants, including product recommendations, product brainstorming, product bundling, event promotions, collaborative marketing, target marketing, and demand fore-casting etc. To bolster user privacy and foster sustainable business collaboration, especially among micro-, small-, and medium-sized enterprises (MSMEs), the GCI model was trained through federated learning, especially with glocalization. This study delves into the structure, development, and assessment of GCI, showcasing its transformative capacity to implement User Centric AI and re-shape the global commerce landscape to benefit MSMEs.



Paperid:2549
Authors:Kyungsik Lee, Hana Yoo, Sumin Shin, Wooyoung Kim, Yeonung Baek, Hyunjin Kang, Jaehyun Kim, Kee-Eung Kim
Hyundai Capital, Hyundai Capital, Hyundai Capital, Hyundai Capital, Hyundai Capital, Hyundai Capital, Hyundai Capital, KAIST
Abstract:
In the field of finance, the underwriting process is an essential step in evaluating every loan application. During this stage, the borrowers' creditworthiness and ability to repay the loan are assessed to ultimately decide whether to approve the loan application. One of the core components of underwriting is credit scoring, in which the probability of default is estimated. As such, there has been significant progress in enhancing the predictive accuracy of credit scoring models through the use of machine learning, but there still exists a need to ultimately construct an approval rule that takes into consideration additional criteria beyond the score itself. This construction process is traditionally done manually to ensure that the approval rule remains interpretable to humans. In this paper, we outline an automated system for optimizing a rulebased system for approving loan applications, which has been deployed at Hyundai Capital Services (HCS). The main challenge lay in creating a high-quality rule base that is simultaneously simple enough to be interpretable by risk analysts as well as customers, since the approval decision should be accountable. We addressed this challenge through principled submodular optimization. The deployment of our system has led to a 14% annual growth in the volume of loan services at HCS, while maintaining the target bad rate, and has resulted in the approval of customers who might have otherwise been rejected.



Paperid:2550
Authors:Chang Liu, Peng Hou, Anxiang Zeng, Han Yu
Shopee Pte. Ltd., Singapore School of Computer Science and Engineering, Nanyang Technological University, Singapore, Shopee Pte. Ltd., Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore
Abstract:
Over the past decade, significant advances have been made in the field of image search for ecommerce applications. Traditional image-to-image retrieval models, which focus solely on image details such as texture, tend to overlook useful semantic information contained within the images. As a result, the retrieved products might possess similar image details, but fail to fulfil the user's search goals. Moreover, the use of image-to-image retrieval models for products containing multiple images results in significant online product feature storage overhead and complex mapping implementations. In this paper, we report the design and deployment of the proposed Multi-modal Item Embedding Model (MIEM) to address these limitations. It is capable of utilizing both textual information and multiple images about a product to construct meaningful product features. By leveraging semantic information from images, MIEM effectively supplements the image search process, improving the overall accuracy of retrieval results. MIEM has become an integral part of the Shopee image search platform. Since its deployment in March 2023, it has achieved a remarkable 9.90% increase in terms of clicks per user and a 4.23% boost in terms of orders per user for the image search feature on the Shopee e-commerce platform.



Paperid:2551
Authors:Anqi Lu, Zifeng Wu, Zheng Jiang, Wei Wang, Eerdun Hasi, Yi Wang
Beijing University of Posts and Telecommunications, Beijing Normal University, Beijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications, Beijing Normal University, Beijing University of Posts and Telecommunications
Abstract:
Visual interpretation is extremely important in human geography as the primary technique for geographers to use photograph data in identifying, classifying, and quantifying geographic and topological objects or regions. However, it is also timeconsuming and requires overwhelming manual effort from professional geographers. This paper describes our interdisciplinary team's efforts in integrating computer vision models with geographers' visual image interpretation process to reduce their workload in interpreting images. Focusing on the dune segmentation task, we proposed an approach featuring a deep dune segmentation model to identify dunes and label their ranges in an automated way. By developing a tool to connect our model with ArcGIS, one of the most popular workbenches for visual interpretation, geographers can further refine the automatically-generated dune segmentation on images without learning any CV or deep learning techniques. Our approach thus realized a non-invasive change to geographers' visual interpretation routines, reducing their manual efforts while incurring minimal interruptions to their work routines and tools they are familiar with. Deployment with a leading Chinese geography research institution demonstrated the potential of our approach in supporting geographers in researching and solving drylands desertification.



Paperid:2552
Authors:Sheng Jie Lui, Cheng Xiang, Shonali Krishnaswamy
National University of Singapore AiDA Technologies, National University of Singapore, AiDA Technologies
Abstract:
Automating the processing of health insurance claims to achieve "StraightThrough Processing" is one of the holy grails that all insurance companies aim to achieve. One of the major impediments to this automation is the difficulty in establishing the relationship between the underwriting exclusions that a policy has and the incoming claim's diagnosis information. Typically, policy underwriting exclusions are captured in free-text such as "Respiratory illnesses are excluded due to a pre-existing asthma condition". A medical claim coming from a hospital would have the diagnosis represented using the International Classification of Disease (ICD) codes from the World Health Organization. The complex and labour-intensive task of establishing the relationship between free-text underwriting exclusions in health insurance policies and medical diagnosis codes from health insurance claims is critical towards determining if a claim should be rejected due to underwriting exclusions. In this work, we present a novel framework that leverages both explicit and implicit domain knowledge present in medical ontologies and pre-trained language models respectively, to effectively establish the relationship between free-text describing medical conditions present in underwriting exclusions and the ICD-10CM diagnosis codes in health insurance claims. Termed KAMEL (Knowledge Aware Medical Entity Linkage), our proposed framework addresses the limitations faced by prior approaches when evaluated on real-world health insurance claims data. Our proposed framework have been deployed in several multi-national health insurance providers to automate their health insurance claims.



Paperid:2553
Authors:Johannes Rehm, Irina Reshodko, Stian Zimmermann Børresen, Odd Erik Gundersen
Norwegian University of Science and Technology, Way AS, Way AS, Norwegian University of Science and Technology
Abstract:
This paper introduces the design, development, and deployment of a Virtual Driving Instructor (VDI) for enhanced driver education. The VDI provides personalized, realtime feedback to students in a driving simulator, addressing some of the limitations of traditional driver instruction. Employing a hybrid AI system, the VDI combines rule-based agents, learning-based agents, knowledge graphs, and Bayesian networks to assess and monitor student performance in a comprehensive manner. Implemented in multiple simulators at a driving school in Norway, the system aims to leverage AI and driving simulation to improve both the learning experience and the efficiency of instruction. Initial feedback from students has been largely positive, highlighting the effectiveness of this integration while also pointing to areas for further improvement. This work marks a significant stride in infusing technology into driver education, offering a scalable and efficient approach to instruction.



Paperid:2554
Authors:Yuliang Shi, Lin Cheng, Cheng Jiang, Hui Zhang, Guifeng Li, Xiaoli Tang, Han Yu, Zhiqi Shen, Cyril Leung
School of Software, Shandong University (SDU), Jinan, China Dareway Software Co. Ltd, Jinan, China Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (SDU), Jinan, China, School of Software, Shandong University (SDU), Jinan, China, Dareway Software Co. Ltd, Jinan, China, School of Software, Shandong University (SDU), Jinan, China Dareway Software Co. Ltd, Jinan, China, Dareway Software Co. Ltd, Jinan, China, Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (SDU), Jinan, China School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore, Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (SDU), Jinan, China School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore, Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (SDU), Jinan, China School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore, Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University (SDU), Jinan, China Department of Electrical and Computer Engineering, The University of British Columbia (UBC), Vancouver, BC, Canada
Abstract:
Social insurance benefits qualification assessment is an important task to ensure that retirees enjoy their benefits according to the regulations. It also plays a key role in curbing social security frauds. In this paper, we report the deployment of the Intelligent Benefit Certification and Analysis (IBCA) platform, an AIempowered platform for verifying the status of retirees to ensure proper dispursement of funds in Shandong province, China. Based on an improved Gated Recurrent Unit (GRU) neural network, IBCA aggregates missing value interpolation, temporal information, and global and local feature extraction to perform accurate retiree survival rate prediction. Based on the predicted results, a reliability assessment mechanism based on Variational Auto-Encoder (VAE) and Monte-Carlo Dropout (MC Dropout) is executed to perform reliability assessment. Deployed since November 2019, the IBCA platform has been adopted by 12 cities across the Shandong province, handling over 50 terabytes of data. It has empowered human resources and social services, civil affairs, and health care institutions to collaboratively provide high-quality public services. Under the IBCA platform, the efficiency of resources utilization as well as the accuracy of benefit qualification assessment have been significantly improved. It has helped Dareway Software Co. Ltd earn over RMB 50 million of revenue.



Paperid:2555
Authors:Hao Sun, Xiaoli Tang, Chengyi Yang, Zhenpeng Yu, Xiuli Wang, Qijie Ding, Zengxiang Li, Han Yu
ENN Group, Nanyang Technological University, ENN Group, ENN Group, ENN Group, ENN Group, ENN Group, Nanyang Technological University (NTU)
Abstract:
Gas usage estimation plays a critical role in various aspects of the power generation and delivery business, including budgeting, resource planning, and environmental preservation. Federated Learning (FL) has demonstrated its potential in enhancing the accuracy and reliability of gas usage estimation by enabling distributedly owned data to be leveraged, while ensuring privacy and confidentiality. However, to effectively motivate stakeholders to contribute their highquality local data and computational resources for this purpose, incentive mechanism design is key. In this paper, we report our experience designing and deploying the Hierarchical FL Incentive mechanism for Gas usage estimation (HiFi-Gas) system. It is designed to cater to the unique structure of gas companies and their affiliated heating stations. HiFi-Gas provides effective incentivization in a hierarchical federated learning framework that consists of a horizontal federated learning (HFL) component for effective collaboration among gas companies and multiple vertical federated learning (VFL) components for the gas company and its affiliated heating stations. To motivate active participation and ensure fairness among gas companies and heating stations, we incorporate a multi-dimensional contribution-aware reward distribution function that considers both data quality and model contributions. Since its deployment in the ENN Group in December 2022, HiFi-Gas has successfully provided incentives for gas companies and heating stations to actively participate in FL training, resulting in more than 12% higher average gas usage estimation accuracy and substantial gas procurement cost savings. This implementation marks the first successful deployment of a hierarchical FL incentive approach in the energy industry.



Paperid:2556
Authors:Siva Likitha Valluru, Biplav Srivastava, Sai Teja Paladi, Siwen Yan, Sriraam Natarajan
University of South Carolina, University of South Carolina, University of South Carolina, University of Texas at Dallas, University of Texas at Dallas
Abstract:
Building teams and promoting collaboration are two very common business activities. An example of these are seen in the TeamingForFunding problem, where research institutions and researchers are interested to identify collaborative opportunities when applying to funding agencies in response to latter's calls for proposals. We describe a novel deployed system to recommend teams using a variety of AI methods, such that (1) each team achieves the highest possible skill coverage that is demanded by the opportunity, and (2) the workload of distributing the opportunities is balanced amongst the candidate members. We address these questions by extracting skills latent in open data of proposal calls (demand) and researcher profiles (supply), normalizing them using taxonomies, and creating efficient algorithms that match demand to supply. We create teams to maximize goodness along a novel metric balancing shortand long-term objectives. We validate the success of our algorithms (1) quantitatively, by evaluating the recommended teams using a goodness score and find that more informed methods lead to recommendations of smaller number of teams but higher goodness, and (2) qualitatively, by conducting a large-scale user study at a college-wide level, and demonstrate that users overall found the tool very useful and relevant. Lastly, we evaluate our system in two diverse settings in US and India (of researchers and proposal calls) to establish generality of our approach, and deploy it at a major US university for routine use.



Paperid:2557
Authors:Prerna Agarwal, Harshit Dave, Jayachandu Bandlamudi, Renuka Sindhgatta, Kushal Mukherjee
IBM Research, ABV-IIITM Gwalior, IBM Research, IBM Research, IBM Research
Abstract:
Traditional business processes such as loan processing, order processing, or procurement have a series of steps that are predefined at design and executed by enterprise systems. Recent advancements in new-age businesses, however, focus on having adaptive and ad-hoc processes by stitching together a set of functions or steps enabled through autonomous agents. Further, to enable business users to execute a flexible set of steps, there have been works on providing a conversational interface to interact and execute automation. Often, it is necessary to guide the user through the set of possible steps in the process (or workflow). Existing work on recommending the next agent to run relies on historical data. However, with changing workflows and new automation constantly getting added, it is important to provide recommendations without historical data. Additionally, hand-crafted recommendation rules do not scale. The adaptive workflow being a combination of structured and unstructured information, makes it harder to mine. Hence, in this work, we leverage Large Language Models (LLMs) to combine process knowledge with the meta-data of agents to discover NBAs specifically at cold-start. We propose a multi-stage approach that uses existing process knowledge and agent meta-data information to prompt LLM and recommend meaningful next best agent (NBA) based on user utterances.



Paperid:2558
Authors:J. Fredrik R. Bjørnland, Yrjar Gedde, Johannes Rehm, Irina Reshodko, Odd Erik Gundersen
Norwegian University of Science and Technology, Norwegian University of Science and Technology, Norwegian University of Science and Technology Way AS, Norwegian University of Science and Technology Way AS, Norwegian University of Science and Technology
Abstract:
Currently, students acquire driving skills by practicing in actual traffic conditions and through direct interactions with an instructor. While oneon-one interactions could be tailored to a student’s learning style and skill level, making them effective for learning, one-on-one interactions are also inefficient, potentially costly, and not standardized with limitations on which traffic situation can be safely taught. For these exact reasons Way AS has developed and commercially deployed a virtual driving instructor that educates students in high-fidelity simulators. In this paper, we present a module, the Lesson generator, that extends the virtual driving instructor to generate personalized lessons for individual students with the goal to practice in a focused and deliberately fashion the skills that need practice for the students to become proficient drivers. A case study is presented, and the path to deployment is discussed.



Paperid:2559
Authors:Marc W. Brittain, Luis E. Alvarez, Kara Breeden
MIT Lincoln Laboratory, MIT Lincoln Laboratory, MIT Lincoln Laboratory
Abstract:
Advanced Air Mobility (AAM) introduces a new, efficient mode of transportation with the use of vehicle autonomy and electrified aircraft to provide increasingly autonomous transportation between previously underserved markets. Safe and efficient navigation of low altitude aircraft through highly dense environments requires the integration of a multitude of complex observations, such as surveillance, knowledge of vehicle dynamics, and weather. The processing and reasoning on these observations pose challenges due to the various sources of uncertainty in the information while ensuring cooperation with a variable number of aircraft in the airspace. These challenges coupled with the requirement to make safetycritical decisions in real-time rule out the use of conventional separation assurance techniques. We present a decentralized reinforcement learning framework to provide autonomous self-separation capabilities within AAM corridors with the use of speed and vertical maneuvers. The problem is formulated as a Markov Decision Process and solved by developing a novel extension to the sample-efficient, off-policy soft actor-critic (SAC) algorithm. We introduce the use of attention networks for variable-length observation processing and a distributed computing architecture to achieve high training sample throughput as compared to existing approaches. A comprehensive numerical study shows that the proposed framework can ensure safe and efficient separation of aircraft in high density, dynamic environments with various sources of uncertainty.



Paperid:2560
Authors:Glenn Bruns, Michael Haidar
California State University, Monterey Bay, California State University, Monterey Bay
Abstract:
In neural memory decoding, a concept being mentally recalled is identified using brain data. Recently, the feasibility of neural memory decoding with EEG data has been demonstrated. Here we propose a new application – neural information retrieval – that uses neural memory decoding to allow a document to be retrieved merely by thinking about it. In this paper we describe neural memory decoding, define the application of neural information retrieval, present experimental results related to the practicality of the application, and discuss issues of deployment and data privacy.



Paperid:2561
Authors:Maria Chang, Achille Fokoue, Rosario Uceda-Sosa, Parul Awasthy, Ken Barker, Sadhana Kumaravel, Oktie Hassanzadeh, Elton Soares, Tian Gao, Debarun Bhattacharjya, Radu Florian, Salim Roukos
IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research
Abstract:
Chronological and Hierarchical Reasoning Over Naturally Occurring Schemas (CHRONOS) is a system that combines language modelbased natural language processing with symbolic knowledge representations to analyze and make predictions about newsworthy events. CHRONOS consists of an event-centric information extraction pipeline and a complex event schema instantiation and prediction system. Resulting predictions are detailed with arguments, event types from Wikidata, schema-based justifications, and source document provenance. We evaluate our system by its ability to capture the structure of unseen events described in news articles and make plausible predictions as judged by human annotators.



Paperid:2562
Authors:Muntabir Hasan Choudhury, Lamia Salsabil, William A. Ingram, Edward A. Fox, Jian Wu
Old Dominion University, Old Dominion University, Virginia Polytechnic Institute and State University, Virginia Polytechnic Institute and State University, Old Dominion University
Abstract:
Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference and journal papers. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, to discover and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents, and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a twostream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation).



Paperid:2563
Authors:Rupasree Dey, Alan Fern
Oregon State University, Oregon State University
Abstract:
Fire Departments conduct inspections to prevent fires but it is unclear how to best allocate their limited inspection resources across the properties in a city. Currently, they use their intuition and experience to decide on which properties to inspect and lack a datadriven approach that could lead to a more principled use of inspection resources. The main contribution of this paper is to investigate such an approach, based on machine learning for predicting a fire risk score for properties in a city based on historical fire-incident data. These scores can then be used to help prioritize inspection resources toward higher-risk properties. We present a case study using data from a South Dakota fire department which contains information about properties in a city along with records of fire in- incidents. We use this data consisting of more than 72,000 properties to train a machine learning model to predict fire risk and evaluate its ability to rank the fire risk of properties in the city. We conduct and analyze experiments with variations of XG-Boost, which is an algorithm well-suited to the challenges in application, including missing data and a highly-skewed class distribution. Our evaluation of the model-generated rankings, based on ranking metrics, shows that the model significantly outperforms random rankings and other natural baselines. We also analyze the feature importance computed for the models, which provides further insight into the model behavior. This model has been integrated into an interface for displaying the rankings across a city and is ready for beta testing.



Paperid:2564
Authors:Bhanu Teja Gullapalli, Stephanie Carreiro, Brittany P Chapman, Eric L Garland, Tauhidur Rahman
University of California San Diego, UMass Chan Medical School, UMass Chan Medical School, University of Utah, University of California San Diego
Abstract:
Longterm and high-dose prescription opioid use places individuals at risk for opioid misuse, opioid use disorder (OUD), and overdose. Existing methods for monitoring opioid use and detecting misuse rely on self-reports, which are prone to reporting bias, and toxicology testing, which may be infeasible in outpatient settings. Although wearable technologies for monitoring day-to-day health metrics have gained significant traction in recent years due to their ease of use, flexibility, and advancements in sensor technology, their application within the opioid use space remains underexplored. In the current work, we demonstrate that oral opioid administrations can be detected using physiological signals collected from a wrist sensor. More importantly, we show that models informed by opioid pharmacokinetics increase reliability in predicting the timing of opioid administrations. Forty-two individuals who were prescribed opioids as a part of their medical treatment in-hospital and after discharge were enrolled. Participants wore a wrist sensor throughout the study, while opioid administrations were tracked using electronic medical records and self-reports. We collected 1,983 hours of sensor data containing 187 opioid administrations from the inpatient setting and 927 hours of sensor data containing 40 opioid administrations from the outpatient setting. We demonstrate that a self-supervised pre-trained model, capable of learning the canonical time series of plasma concentration of the drug derived from opioid pharmacokinetics, can reliably detect opioid administration in both settings. Our work suggests the potential of pharmacokinetic-informed, data-driven models to objectively detect opioid use in daily life.



Paperid:2565
Authors:Sawinder Kaur, Yi Xiao, Asif Salekin
Syracuse University, Syracuse University, Syracuse University
Abstract:
AI's widespread integration has led to neural networks (NN) deployment on edge and similar limitedresource platforms for safety-critical scenarios. Yet, NN's fragility raises concerns about reliable inference. Moreover, constrained platforms demand compact networks. This study introduces VeriCompress, a tool that automates the search and training of compressed models with robustness guarantees. These models are well-suited for safety-critical applications and adhere to predefined architecture and size limitations, making them deployable on resource-restricted platforms. The method trains models 2-3 times faster than the state-of-the-art approaches, surpassing them by average accuracy and robustness gains of 15.1 and 9.8 percentage points, respectively. When deployed on a resource-restricted generic platform, these models require 5-8 times less memory and 2-4 times less inference time than models used in verified robustness literature. Our comprehensive evaluation across various model architectures and datasets, including MNIST, CIFAR, SVHN, and a relevant pedestrian detection dataset, showcases VeriCompress's capacity to identify compressed verified robust models with reduced computation overhead compared to current standards. This underscores its potential as a valuable tool for end users, such as developers of safety-critical applications on edge or Internet of Things platforms, empowering them to create suitable models for safety-critical, resource-constrained platforms in their respective domains.



Paperid:2566
Authors:Harsh Kumar, Tong Li, Jiakai Shi, Ilya Musabirov, Rachel Kornfield, Jonah Meyerhoff, Ananya Bhattacharjee, Chris Karr, Theresa Nguyen, David Mohr, Anna Rafferty, Sofia Villar, Nina Deliu, Joseph Jay Williams
University of Toronto, University of Toronto, University of Toronto, University of Toronto, Northwestern University, Northwestern University, University of Toronto, Audacious Software, Mental Health America, Northwestern University, Carleton College, University of Cambridge, University of Cambridge Sapienza University of Rome, University of Toronto
Abstract:
Digital mental health (DMH) interventions, such as textmessage-based lessons and activities, offer immense potential for accessible mental health support. While these interventions can be effective, real-world experimental testing can further enhance their design and impact. Adaptive experimentation, utilizing algorithms like Thompson Sampling for (contextual) multi-armed bandit (MAB) problems, can lead to continuous improvement and personalization. However, it remains unclear when these algorithms can simultaneously increase user experience rewards and facilitate appropriate data collection for social-behavioral scientists to analyze with sufficient statistical confidence. Although a growing body of research addresses the practical and statistical aspects of MAB and other adaptive algorithms, further exploration is needed to assess their impact across diverse real-world contexts. This paper presents a software system developed over two years that allows text-messaging intervention components to be adapted using bandit and other algorithms while collecting data for side-by-side comparison with traditional uniform random non-adaptive experiments. We evaluate the system by deploying a text-message-based DMH intervention to 1100 users, recruited through a large mental health non-profit organization, and share the path forward for deploying this system at scale. This system not only enables applications in mental health but could also serve as a model testbed for adaptive experimentation algorithms in other domains.



Paperid:2567
Authors:Arshika Lalan, Shresth Verma, Paula Rodriguez Diaz, Panayiotis Danassis, Amrita Mahale, Kumar Madhu Sudan, Aparna Hegde, Milind Tambe, Aparna Taneja
Google Research India, Google Research India, Harvard University (SEAS), Harvard University (SEAS), ARMMAN, ARMMAN, ARMMAN, Google Research India, Google Research India
Abstract:
Harnessing the widespread availability of cell phones, many nonprofits have launched mobile health (mHealth) programs to deliver information via voice or text to beneficiaries in underserved communities, with maternal and infant health being a key area of such mHealth programs. Unfortunately, dwindling listenership is a major challenge, requiring targeted interventions using limited resources. This paper focuses on Kilkari, the world's largest mHealth program for maternal and child care -- with over 3 million active subscribers at a time -- launched by India's Ministry of Health and Family Welfare (MoHFW) and run by the non-profit ARMMAN. We present a system called CHAHAK that aims to reduce automated dropouts as well as boost engagement with the program through the strategic allocation of interventions to beneficiaries. Past work in a similar domain has focused on a much smaller scale mHealth program and used markovian restless multiarmed bandits to optimize a single limited intervention resource. However this paper demonstrates the challenges in adopting a markovian approach in Kilkari; therefore CHAHAK instead relies on non-markovian time-series restless bandits, and optimizes a layered set of multiple interventions to improve listenership. We use real Kilkari data from the Odisha state in India to show CHAHAK's effectiveness in harnessing multiple interventions to boost listenership, benefiting marginalized communities. When deployed CHAHAK will assist the largest maternal mHealth program to date.



Paperid:2568
Authors:Bingxuan Li, Antonio Castellanos, Pengyi Shi, Amy Ward
Purdue University, University of Chicago, Purdue University, University of Chicago
Abstract:
Incarcerationdiversion programs have proven effective in reducing recidivism. Accurate prediction of the number of individuals with different characteristics in the program and their program outcomes based on given eligibility criteria is crucial for successful implementation, because this prediction serves as the foundation for determining the appropriate program size and the consequent staffing requirements. However, this task poses challenges due to the complexities arising from varied outcomes and lengths-of-stay for the diverse individuals in incarceration-diversion programs. In collaboration with an Illinois government agency, we develop a framework to address these issues. Our framework combines ML and queueing model simulation, providing accurate predictions for the program census and interpretable insights into program dynamics and the impact of different decisions in counterfactual scenarios. Additionally, we deploy a user-friendly web app beta-version that allows program managers to visualize census data by counties and race groups. We showcase two decision support use cases: Changing program admission criteria and launching similar programs in new counties.



Paperid:2569
Authors:ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu
Internet of Things Laboratory, Chunghwa Telecom Laboratories National Taiwan University, Internet of Things Laboratory, Chunghwa Telecom Laboratories, National Taiwan University, National Taiwan University Mobile Drive Technology
Abstract:
To address the limitations of traffic prediction from locationbound detectors, we present Geographical Cellular Traffic (GCT) flow, a novel data source that leverages the extensive coverage of cellular traffic to capture mobility patterns. Our extensive analysis validates its potential for transportation. Focusing on vehicle-related GCT flow prediction, we propose a graph neural network that integrates multivariate, temporal, and spatial facets for improved accuracy. Experiments reveal our model's superiority over baselines, especially in long-term predictions. We also highlight the potential for GCT flow integration into transportation systems.



Paperid:2570
Authors:Karol Lynch, Bradley Eck, Joern Ploennigs
IBM Research Europe, Dublin, Ireland, IBM Research Europe, Dublin, Ireland, IBM Research Europe, Dublin, Ireland University of Rostock, Rostock, Germany
Abstract:
Mathematical formulas give concise representations of a document's key ideas in many natural sciences and engineering domains. The symbols that make up formulas carry semantic meaning that may differ by document or equation. What does ? mean in a given paper? Interpreting the symbols that comprise formulas requires identifying descriptions from the surrounding text. We approach this task of symbol description reading as an application of current AI technologies targeting the tuning of large language models for particular domains and automation of machine learning. Our pipeline integrates AI question answering and natural language processing to read symbol descriptions. We consider extractive and generative AI model variations and apply our pipeline on two example tasks of symbol description reading. Promising results provide motivation for wider deployment for which we describe a microservice architecture and related challenges.



Paperid:2571
Authors:Prateeti Mohapatra, Gargi Dasgupta
IBM Research, IBM Software
Abstract:
Technical support services get several thousand voice calls every year. These calls vary across a range of technical issues or maintenance requests for a suite of hardware and software products. On receiving the call, a support agent creates a service request artifact that contains her interpretation of the customer’s problem. This service request goes through the life cycle of the problem remediation process with the resolution also being recorded as part of the service request. It has been empirically observed that the actual complaint voiced by the customer is often different from the recorded interpretation in the service request. The service request created by sup- port agents runs the risk of missing key information elements present in the customer voice records. In this paper, we build a framework that taps into voice calls and uses unsupervised and supervised learning methods to enrich the service requests with additional information. The enriched data is then used for automated problem resolution.



Paperid:2572
Authors:Leopold Müller, Patrick Hemmer, Moritz Queisner, Igor Sauer, Simeon Allmendinger, Johannes Jakubik, Michael Vössing, Niklas Kühl
Fraunhofer FIT University of Bayreuth Karlsruhe Institute of Technology, Karlsruhe Institute of Technology, Charité Universitätsmedizin Berlin, Charité Universitätsmedizin Berlin, Fraunhofer FIT University of Bayreuth Karlsruhe Institute of Technology, Karlsruhe Institute of Technology, Karlsruhe Institute of Technology, Fraunhofer FIT University of Bayreuth Karlsruhe Institute of Technology
Abstract:
A significant challenge in imageguided surgery is the accurate measurement task of relevant structures such as vessel segments, resection margins, or bowel lengths. While this task is an essential component of many surgeries, it involves substantial human effort and is prone to inaccuracies. In this paper, we develop a novel human-AI-based method for laparoscopic measurements utilizing stereo vision that has been guided by practicing surgeons. Based on a holistic qualitative requirements analysis, this work proposes a comprehensive measurement method, which comprises state-of-the-art machine learning architectures, such as RAFT-Stereo and YOLOv8. The developed method is assessed in various realistic experimental evaluation environments. Our results outline the potential of our method achieving high accuracies in distance measurements with errors below 1 mm. Furthermore, on-surface measurements demonstrate robustness when applied in challenging environments with textureless regions. Overall, by addressing the inherent challenges of image-guided surgery, we lay the foundation for a more robust and accurate solution for intra- and postoperative measurements, enabling more precise, safe, and efficient surgical procedures.



Paperid:2573
Authors:Anh N. Nhu, Abderahmen Zoghbi
University of Maryland, College Park, MD 20742, USA CRESST II, NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA HEASARC, Code 6601, NASA/GSFC, Greenbelt, MD 20771, USA, University of Maryland, College Park, MD 20742, USA CRESST II, NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA HEASARC, Code 6601, NASA/GSFC, Greenbelt, MD 20771, USA
Abstract:
The Neutron star Interior Composition Explorer (NICER) is an International Space Station (ISS)based Space Telescope developed by NASA and devoted to the study of high-energy X-Ray sources in the universe, including but not limited to neutron stars, pulsars, and black holes in stellar systems and active galactic nuclei (AGN). One prominent problem with NICER observations is the highly variable background spectra, obscuring actual signals of astrophysical sources and negatively affecting scientific analysis of the targets. Therefore, obtaining accurate estimations of the background spectra is crucial to filter the noise and facilitate better scientific discoveries of new astronomical objects. In this paper, we propose the very first Deep Neural Network architecture to model the NICER background spectra variation using information about the spacecraft and telescope associated with each observation. In particular, we develop a BERT-based architecture with tokenizers applied to different groups of features in our tabular dataset. We also introduce an adapted Tabular Deep Residual Network architecture as the predictor following the Transformer modules in our network. We show that our model outperforms the current state-of-the-art background model developed by the NICER team in most evaluation metrics. Finally, we discuss pathways and future work for the deployment of this model on NASA’s next versions of HEASARC Software packages.



Paperid:2574
Authors:Santosh Palaskar, Vijay Ekambaram, Arindam Jati, Neelamadhav Gantayat, Avirup Saha, Seema Nagar, Nam H. Nguyen, Pankaj Dayama, Renuka Sindhgatta, Prateeti Mohapatra, Harshit Kumar, Jayant Kalagnanam, Nandyala Hemachandra, Narayan Rangaraj
Indian Institute of Technology Bombay, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research, IBM, Research, IBM Research, IBM Research, IBM Research, IBM Research, Indian Institute of Technology Bombay, IIT Bombay
Abstract:
The efficiency of business processes relies on business key performance indicators (BizKPIs), that can be negatively impacted by IT failures. Business and IT Observability (BizITObs) data fuses both Biz-KPIs and IT event channels together as multivariate time series data. Forecasting Biz-KPIs in advance can enhance efficiency and revenue through proactive corrective measures. However, BizITObs data generally exhibit both useful and noisy inter-channel interactions between Biz-KPIs and IT events that need to be effectively decoupled. This leads to suboptimal forecasting performance when existing multivariate forecasting models are employed. To address this, we introduce AutoMixer, a time-series Foundation Model (FM) approach, grounded on the novel technique of channel-compressed pretrain and finetune workflows. AutoMixer leverages an AutoEncoder for channel-compressed pretraining and integrates it with the advanced TSMixer model for multivariate time series forecasting. This fusion greatly enhances the potency of TSMixer for accurate forecasts and also generalizes well across several downstream tasks. Through detailed experiments and dashboard analytics, we show AutoMixer's capability to consistently improve the Biz-KPI's forecasting accuracy (by 11-15%) which directly translates to actionable business insights.



Paperid:2575
Authors:Krishu K Thapa, Bhupinderjeet Singh, Supriya Savalkar, Alan Fern, Kirti Rajagopalan, Ananth Kalyanaraman
Washington State University, Washington State University, Washington State University, Oregon State University, Washington State University, Washington State University
Abstract:
Snow WaterEquivalent (SWE)—the amount of water available if snowpack is melted—is a key decision variable used by water management agencies to make irrigation, flood control, power generation, and drought management decisions. SWE values vary spatiotemporally—affected by weather, topography, and other environmental factors. While daily SWE can be measured by Snow Telemetry (SNOTEL) stations with requisite instrumentation, such stations are spatially sparse requiring interpolation techniques to create spatiotemporal complete data. While recent efforts have explored machine learning (ML) for SWE prediction, a number of recent ML advances have yet to be considered. The main contribution of this paper is to explore one such ML advance, attention mechanisms, for SWE prediction. Our hypothesis is that attention has a unique ability to capture and exploit correlations that may exist across locations or the temporal spectrum (or both). We present a generic attention-based modeling framework for SWE prediction and adapt it to capture spatial attention and temporal attention. Our experimental results on 323 SNOTEL stations in the Western U.S. demonstrate that our attention-based models outperform other machine-learning approaches. We also provide key results highlighting the differences between spatial and temporal attention in this context and a roadmap toward deployment for generating spatially-complete SWE maps.



Paperid:2576
Authors:Bhavan Vasu, Steven Lu, Emily Dunkel, Kiri L. Wagstaff, Kevin Grimes, Michael Mcauley
Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, USA Oregon State University, Corvallis, OR 97331, USA, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, USA, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, USA, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, USA Oregon State University, Corvallis, OR 97331, USA, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, USA, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, USA
Abstract:
The NASA Planetary Data System (PDS) hosts millions of images of planets, moons, and other bodies collected throughout many missions. The everexpanding nature of data and user engagement demands an interpretable content classification system to support scientific discovery and individual curiosity. In this paper, we leverage a prototype-based architecture to enable users to understand and validate the evidence used by a classifier trained on images from the Mars Science Laboratory (MSL) Curiosity rover mission. In addition to providing explanations, we investigate the diversity and correctness of evidence used by the content-based classifier. The work presented in this paper will be deployed on the PDS Image Atlas, replacing its non-interpretable counterpart.



Paperid:2577
Authors:Monika Wysoczanska, Moran Beladev, Karen Lastmann Assaraf, Fengjun Wang, Ofri Kleinfeld, Gil Amsalem, Hadas Harush Boke
Warsaw University of Technology, Booking.com, Booking.com, Booking.com, Booking.com, Booking.com, Booking.com
Abstract:
Image collection summarization techniques aim to present a compact representation of an image gallery through a carefully selected subset of images that captures its semantic content. When it comes to web content, however, the ideal selection can vary based on the user's specific intentions and preferences. This is particularly relevant at Booking.com, where presenting properties and their visual summaries that align with users' expectations is crucial. To address this challenge, in this work, we consider user intentions in the summarization of property visuals by analyzing property reviews and extracting the most significant aspects mentioned by users. By incorporating the insights from reviews in our visual summaries, we enhance the summaries by presenting the relevant content to a user. Moreover, we achieve it without the need for costly annotations. Our experiments, including human perceptual studies, demonstrate the superiority of our crossmodal approach, which we coin as CrossSummarizer over the no-personalization and image-based clustering baselines.



Paperid:2578
Authors:Xi Yang, Rohan R. Arora, Saurabh Jha, Chandra Narayanaswami, Cheuk Lam, Jerrold Leichter, Yu Deng, Daby M. Sow
IBM Research, IBM Research, IBM Research, IBM Research, IBM Software, IBM Software, IBM Research, IBM Research
Abstract:
The widespread adoption of public and hybrid clouds, along with elastic resources and various automation tools for dynamic deployment, has accelerated the rapid provisioning of compute resources as needed. Despite these advancements, numerous resources persist unnecessarily due to factors such as poor digital hygiene, risk aversion, or the absence of effective tools, resulting in substantial costs and energy consumption. Existing thresholdbased techniques prove inadequate in effectively addressing this challenge. To address this issue, we propose an unsupervised machine learning framework to automatically identify resources that can be de-provisioned completely or summoned on a schedule. Application of this approach to enterprise data has yielded promising initial results, facilitating the segregation of productive workloads with recurring demands from non-productive ones.



Paperid:2579
Authors:Md Nasim, Xinghang Zhang, Anter El-Azab, Yexiang Xue
Purdue University, Purdue University, Purdue University, Purdue University
Abstract:
The availability of terabyte scale experiment data calls for AI driven approaches which automatically discover scientific models from data. Nonetheless, significant challenges present in AI-driven scientific discovery: (i) The annotation of large scale datasets requires fundamental re-thinking in developing scalable crowdsourcing tools. (ii) The learning of scientific models from data calls for innovations beyond black-box neural nets. (iii) Novel visualization & diagnosis tools are needed for the collaboration of experimental and theoretical physicists, and computer scientists. We present Phase-Field-Lab platform for end-to-end phase field model discovery, which automatically discovers phase field physics models from experiment data, integrating experimentation, crowdsourcing, simulation and learning. Phase-Field-Lab combines (i) a streamlined annotation tool which reduces the annotation time (by ~50-75%), while increasing annotation accuracy compared to baseline; (ii) an end-to-end neural model which automatically learns phase field models from data by embedding phase field simulation and existing domain knowledge into learning; and (iii) novel interfaces and visualizations to integrate our platform into the scientific discovery cycle of domain scientists. Our platform is deployed in the analysis of nano-structure evolution in materials under extreme conditions (high temperature and irradiation). Our approach reveals new properties of nano-void defects, which otherwise cannot be detected via manual analysis.



Paperid:2580
Authors:Neil Thompson, Martin Fleming, Benny J. Tang, Anna M. Pastwa, Nicholas Borge, Brian C. Goehring, Subhro Das
MIT, The Productivity Institute, Varicent, MIT, MIT University of Warsaw, IBM, IBM, IBM
Abstract:
Deep learning, the most important subfield of machine learning and artificial intelligence (AI) over the last decade, is considered one of the fundamental technologies underpinning the Fourth Industrial Revolution. But despite its recordbreaking history, deep learning’s enormous appetite for compute and data means that sometimes it can be too costly to practically use. In this paper, we connect technical insights from deep learning scaling laws and transfer learning with the economics of IT to propose a framework for estimating the cost of deep learning computer vision systems to achieve a desired level of accuracy. Our tool can be of practical use to AI practitioners in industry or academia to guide investment decisions.



Paperid:2581
Authors:Maira Alvi, Tim French, Philip Keymer, Rachel Cardell-Oliver
The University of Western Australia, The University of Western Australia, The University of Queensland, The University of Western Australia
Abstract:
Complex urban systems can be difficult to monitor, diagnose and manage because the complete states of such systems are only partially observable with sensors. State estimation techniques can be used to determine the underlying dynamic behavior of such complex systems with their highly nonlinear processes and external time-variant influences. States can be estimated by clustering observed sensor readings. However, clustering performance degrades as the number of sensors and readings (i.e. feature dimension) increases. To address this problem, we propose a framework that learns a feature-centric lower dimensional representation of data for clustering to support analysis of system dynamics. We propose Unsupervised Feature Attention with Compact Representation (UFACR) to rank features contributing to a cluster assignment. These weighted features are then used to learn a reduced-dimension temporal representation of the data with a deep-learning model. The resulting low-dimensional representation can be effectively clustered into states. UFACR is evaluated on real-world and synthetic wastewater treatment plant data sets, and feature ranking outcomes were validated by Wastewater treatment domain experts. Our quantitative and qualitative experimental analyses demonstrate the effectiveness of UFACR for uncovering system dynamics in an automated and unsupervised manner to offer guidance to wastewater engineers to enhance industrial productivity and treatment efficiency.



Paperid:2582
Authors:Zakaria Mehrab, Logan Stundal, Srinivasan Venkatramanan, Samarth Swarup, Bryan Leroy Lewis, Henning S. Mortveit, Christopher L. Barrett, Abhishek Pandey, Chad R. Wells, Alison P. Galvani, Burton H. Singer, Seyed M. Moghadas, David Leblang, Rita R. Colwell, Madhav V. Marathe
University of Virginia, University of Virginia, University of Virginia, University of Virginia, University of Virginia, University of Virginia, University of Virginia, Yale School of Public Health, Yale School of Public Health, Yale School of Public Health, University of Florida, York University, University of Virginia, University of Maryland, University of Virginia
Abstract:
Largescale population displacements arising from conflict-induced forced migration generate uncertainty and introduce several policy challenges. Addressing these concerns requires an interdisciplinary approach that integrates knowledge from both computational modeling and social sciences. We propose a generalized computational agent-based modeling framework grounded by Theory of Planned Behavior to model conflict-induced migration outflows within Ukraine during the start of that conflict in 2022. Existing migration modeling frameworks that attempt to address policy implications primarily focus on destination while leaving absent a generalized computational framework grounded by social theory focused on the conflict-induced region. We propose an agent-based framework utilizing a spatiotemporal gravity model and a Bi-threshold model over a Graph Dynamical System to update migration status of agents in conflict-induced regions at fine temporal and spatial granularity. This approach significantly outperforms previous work when examining the case of Russian invasion in Ukraine. Policy implications of the proposed framework are demonstrated by modeling the migration behavior of Ukrainian civilians attempting to flee from regions encircled by Russian forces. We also showcase the generalizability of the model by simulating a past conflict in Burundi, an alternative conflict setting. Results demonstrate the utility of the framework for assessing conflict-induced migration in varied settings as well as identifying vulnerable civilian populations.



Paperid:2583
Authors:Arihant Chadda, Sean McGregor, Jesse Hostetler, Andrea Brennen
IQT Labs, UL Digital Safety Research Institute, UL Digital Safety Research Institute, IQT Labs
Abstract:
Intelligent system audits are laborintensive assurance activities that are typically performed once and discarded along with the opportunity to programmatically test all similar products for the market. This study illustrates how several incidents (i.e., harms) involving Named Entity Recognition (NER) can be prevented by scaling up a previously-performed audit of NER systems. The audit instrument's diagnostic capacity is maintained through a security model that protects the underlying data (i.e., addresses Goodhart's Law). An open-source evaluation infrastructure is released along with an example derived from a real-world audit that reports aggregated findings without exposing the underlying data.



Paperid:2584
Authors:Kathrin Grosse, Lukas Bieringer, Tarek R. Besold, Battista Biggio, Alexandre Alahi
EPFL, Switzerland, QuantPi, Germany, TU Eindhoven, The Netherlands, University of Cagliari, Italy, EPFL, Switzerland
Abstract:
In contrast to vast academic efforts to study AI security, few realworld reports of AI security incidents exist. Released incidents prevent a thorough investigation of the attackers' motives, as crucial information about the company and AI application is missing. As a consequence, it often remains unknown how to avoid incidents. We tackle this gap and combine previous reports with freshly collected incidents to a small database of 32 AI security incidents. We analyze the attackers' target and goal, influencing factors, causes, and mitigations. Many incidents stem from non-compliance with best practices in security and privacy-enhancing technologies. In the case of direct AI attacks, access control may provide some mitigation, but there is little scientific work on best practices. Our paper is thus a call for action to address these gaps.



Paperid:2585
Authors:Eli Sherman, Ian Eisenberg
Credo AI, Credo AI
Abstract:
As AI systems’ sophistication and proliferation have increased, awareness of the risks has grown proportionally. The AI industry is increasingly emphasizing the need for transparency, with proposals ranging from standardizing use of technical disclosures, like model cards, to regulatory licensing regimes. Since the AI value chain is complicated, with actors bringing varied expertise, perspectives, and values, it is crucial that consumers of transparency disclosures be able to understand the risks of the AI system in question. In this paper we propose a risk profiling standard which can guide downstream decisionmaking, including triaging further risk assessment, informing procurement and deployment, and directing regulatory frameworks. The standard is built on our proposed taxonomy of AI risks, which distills the wide variety of risks proposed in the literature into a high-level categorization. We outline the myriad data sources needed to construct informative Risk Profiles and propose a template and methodology for collating risk information into a standard, yet flexible, structure. We apply this methodology to a number of prominent AI systems using publicly available information. To conclude, we discuss design decisions for the profiles and future work.



Paperid:2586
Authors:Christina P. Walker, Daniel S. Schiff, Kaylyn Jackson Schiff
Purdue University, Purdue University, Purdue University
Abstract:
This article presents the Political Deepfakes Incidents Database (PDID), a collection of politicallysalient deepfakes, encompassing synthetically-created videos, images, and less-sophisticated `cheapfakes.' The project is driven by the rise of generative AI in politics, ongoing policy efforts to address harms, and the need to connect AI incidents and political communication research. The database contains political deepfake content, metadata, and researcher-coded descriptors drawn from political science, public policy, communication, and misinformation studies. It aims to help reveal the prevalence, trends, and impact of political deepfakes, such as those featuring major political figures or events. The PDID can benefit policymakers, researchers, journalists, fact-checkers, and the public by providing insights into deepfake usage, aiding in regulation, enabling in-depth analyses, supporting fact-checking and trust-building efforts, and raising awareness of political deepfakes. It is suitable for research and application on media effects, political discourse, AI ethics, technology governance, media literacy, and countermeasures.



Paperid:2587
Authors:Matteo Baldoni, Cristina Baroglio, Monica Bucciarelli, Sara Capecchi, Elena Gandolfi, Cristina Gena, Francesco Ianì, Elisa Marengo, Roberto Micalizio, Amon Rapp, Ivan Nabil Ras
Università di Torino, Dipartimento di Informatica, Università di Torino, Dipartimento di Informatica, Università di Torino, Dipartimento di Psicologia Center for Logic, Language and Cognition, Università di Torino, Dipartimento di Informatica Laboratorio Informatica e Scuola CINI, Università di Torino, Dipartimento di Psicologia, Università di Torino, Dipartimento di Informatica, Università di Torino, Dipartimento di Psicologia Center for Logic, Language and Cognition, Università di Torino, Dipartimento di Informatica, Università di Torino, Dipartimento di Informatica, Università di Torino, Dipartimento di Informatica, Università di Torino, Dipartimento di Psicologia
Abstract:
Artificial Intelligence is undoubtedly becoming pervasive in everyday life of everyone. In this setting, developing correct AI conception since childhood is not only a need to be addressed in educational curricula, but is also a children right. Accordingly, several initiatives at national and international levels aim at promoting AI and emerging technology literacy, supported also by a proliferation in the literature of learning courses covering a variety of topics, learning objectives and targeted ages. Schools are therefore pushed to introduce innovative activities for children in their curricula. In this paper, we report the results of a case study where we tested the contribution of an AI blockbased course in developing computational thinking, and human and AI minds understanding in fifth and sixth grade children.



Paperid:2588
Authors:Nancye Blair Black, Stacy George, Amy Eguchi, J. Camille Dempsey, Elizabeth Langran, Lucretia Fraga, Stein Brunvand, Nicol Howard
Teachers College Columbia University, University of Hawai'i at Mānoa, University of California San Diego, Pennsylvania Western University, Marymount University, University of the Incarnate Word, University of Michigan-Dearborn, University of Redlands
Abstract:
In recent years, the rapid advancement of artificial intelligence (AI) has fostered an urgent need to better prepare current and future educators to be able to integrate AI technologies in their teaching and to teach AI literacy to PreK12 students. While many organizations have developed professional learning opportunities for inservice educators, a gap remains for resources specifically designed for those facilitating and enrolled in Educator Preparation Programs (EPPs). In response to this gap, the International Society for Technology in Education (ISTE) launched its first AI Explorations for EPPs Faculty Fellowship. As a result of the Faculty Fellows’ collaboration, this paper articulates a framework of seven critical strategies with the potential to address the urgent need EPPs have in preparing preservice teachers to effectively integrate AI-powered instructional tools and to teach this new area of content knowledge in PreK-12 classrooms. In addition, we provide a review of literature and an overview of the emerging needs for integrating AI education in EPPs. We demonstrate why support for preservice teachers’ critical examination and application of AI, including a focus on the issues of equity, ethics, and culturally responsive teaching, is essential to their later success in PreK-12 classrooms. Recommendations for further research and learning are also provided to promote community-wide initiatives for supporting the integration of AI in education through Educator Preparation Programs and beyond.



Paperid:2589
Authors:Eric Eaton, Susan L. Epstein
University of Pennsylvania, Hunter College The Graduate Center of The City University of New York
Abstract:
Roughly every decade, the ACM and IEEE professional organizations have produced recommendations for the education of undergraduate computer science students. These guidelines are used worldwide by research universities, liberal arts colleges, and community colleges. For the latest 2023 revision of the curriculum, AAAI has collaborated with ACM and IEEE to integrate artificial intelligence more broadly into this new curriculum and to address the issues it raises for students, instructors, practitioners, policy makers, and the general public. This paper describes the development process and rationale that underlie the artificial intelligence components of the CS2023 curriculum, discusses the challenges in curriculum design for such a rapidly advancing field, and examines lessons learned during this threeyear process.



Paperid:2590
Authors:Sabina Elkins, Ekaterina Kochmar, Jackie C. K. Cheung, Iulian Serban
McGill University & MILA Korbit Technologies Inc., MBZUAI Korbit Technologies Inc., McGill University & MILA Canada CIFAR AI Chair, Korbit Technologies Inc.
Abstract:
Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input of real teachers or students. This paper applies a large language modelbased QG approach where questions are generated with learning goals derived from Bloom's taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting.



Paperid:2591
Authors:Anisha Gupta, Seung Lee, Bradford Mott, Srijita Chakraburty, Krista Glazewski, Anne Ottenbreit-Leftwich, Adam Scribner, Cindy E. Hmelo-Silver, James Lester
North Carolina State University, North Carolina State University, North Carolina State University, Indiana University, North Carolina State University, Indiana University, Indiana University, Indiana University, North Carolina State University
Abstract:
Artificial intelligence (AI) is quickly finding broad application in every sector of society. This rapid expansion of AI has increased the need to cultivate an AIliterate workforce, and it calls for introducing AI education into K-12 classrooms to foster students’ awareness and interest in AI. With rich narratives and opportunities for situated problem solving, story-driven game-based learning offers a promising approach for creating engaging and effective K-12 AI learning experiences. In this paper, we present our ongoing work to iteratively design, develop, and evaluate a story-driven game-based learning environment focused on AI education for upper elementary students (ages 8 to 11). The game features a science inquiry problem centering on an endangered species and incorporates a Use-Modify-Create scaffolding framework to promote student learning. We present findings from an analysis of data collected from 16 students playing the game's quest focused on AI planning. Results suggest that the scaffolding framework provided students with the knowledge they needed to advance through the quest and that overall, students experienced positive learning outcomes.



Paperid:2592
Authors:Suren Jayasuriya, Kimberlee Swisher, Joshua D. Rego, Sreenithy Chandran, John Mativo, Terri Kurz, Cerenity E. Collins, Dawn T. Robinson, Ramana Pidaparti
Arizona State University, Arizona State University, Arizona State University, Arizona State University, University of Georgia, Arizona State University, University of Georgia, University of Georgia, University of Georgia
Abstract:
Artificial intelligence (AI) and its teaching in the K12 grades has been championed as a vital need for the United States due to the technology's future prominence in the 21st century. However, there remain several barriers to effective AI lessons at these age groups including the broad range of interdisciplinary knowledge needed and the lack of formal training or preparation for teachers to implement these lessons. In this experience report, we present ImageSTEAM, a teacher professional development for creating lessons surrounding computer vision, machine learning, and computational photography/cameras targeted for middle school grades 6-8 classes. Teacher professional development workshops were conducted in the states of Arizona and Georgia from 2021-2023 where lessons were co-created with teachers to introduce various specific visual computing concepts while aligning to state and national standards. In addition, the use of a variety of computer vision and image processing software including custom designed Python notebooks were created as technology activities and demonstrations to be used in the classroom. Educational research showed that teachers improved their self-efficacy and outcomes for concepts in computer vision, machine learning, and artificial intelligence when participating in the program. Results from the professional development workshops highlight key opportunities and challenges in integrating this content into the standard curriculum, the benefits of a co-creation pedagogy, and the positive impact on teacher and student's learning experiences. The open-source program curriculum is available at www.imagesteam.org.



Paperid:2593
Authors:Robert Kasumba, Marion Neumman
Washington University in St. Louis, Washington University in St. Louis
Abstract:
Sentiment analysis provides a promising tool to automatically assess the emotions voiced in written student feedback such as periodically collected unitof-study reflections. The commonly used dictionary-based approaches are limited to major languages and fail to capture contextual differences. Pretrained large language models have been shown to be biased and online versions raise privacy concerns. Hence, we resort to traditional supervised machine learning (ML) approaches which are designed to overcome these issues by learning from domain-specific labeled data. However, these labels are hard to come by -- in our case manually annotating student feedback is prone to bias and time-consuming, especially in high-enrollment courses. In this work, we investigate the use of student crowdsourced labels for supervised sentiment analysis for education. Specifically, we compare crowdsourced and student self-reported labels with human expert annotations and use them in various ML approaches to evaluate the performance on predicting emotions of written student feedback collected from large computer science classes. We find that the random forest model trained with student-crowdsourced labels tremendously improves the identification of reflections with negative sentiment. In addition to our quantitative study, we describe our crowdsourcing experiment which was intentionally designed to be an educational activity in an introduction to data science course.



Paperid:2594
Authors:Long Pham, Barry O'Sullivan, Teresa Scantamburlo, Tai Mai
Insight SFI Research Centre for Data Analytics, School of Computer Science & IT, University College Cork, Ireland, Insight SFI Research Centre for Data Analytics, School of Computer Science & IT, University College Cork, Ireland, Department of Environmental Science, Informatics and Statistics, Università Ca' Foscari, Italy, ADAPT Research Centre, School of Computing, Dublin City University, Ireland
Abstract:
As Artificial Intelligence (AI) continues to permeate various aspects of societies, understanding the disparities in AI knowledge and skills across different living areas becomes imperative. Small living areas have emerged as significant contributors to Europe's economy, offering an alternative to the bustling environment of larger cities for those seeking an improved quality of life. Nonetheless, they often encounter challenges related to digital infrastructure, access to financial resources, and digital skills gaps, limiting their economic and social growth prospects. This study investigates the digital and AI skills gaps in the context of small and large European living areas, shedding light on the potential hindrances to unleashing the full economic and social potentials of these regions in an AIenabled economy. Drawing from a comprehensive dataset encompassing 4,006 respondents across eight EU countries, this research examines the current perceptions and understandings of AI and digital skills within two distinct population groups: residents of smaller living areas and their counterparts in larger communities. Through bivariate analysis, notable insights are revealed concerning trust in AI solutions and entities, self-assessed digital skills, AI Awareness, AI Attitudes and demography variables in both population groups. These insights may refer to the significance of addressing digital and AI skills gaps in fostering growth and preparedness for the AI-driven future. As AI becomes increasingly integral to various aspects of society, targeted interventions and policies are essential to bridge these gaps and enable individuals and communities to harness the transformative potential of AI-enabled economies.



Paperid:2595
Authors:Yuxiang Qiu, Karim Djemili, Denis Elezi, Aaneel Shalman Srazali, María Pérez-Ortiz, Emine Yilmaz, John Shawe-Taylor, Sahan Bulathwela
University College London, University College London, University College London, University College London, University College London, University College London, University College London, University College London
Abstract:
With the advancement and utility of Artificial Intelligence (AI), personalising education to a global population could be a cornerstone of new educational systems in the future. This work presents the PEEKC dataset and the TrueLearn Python library, which contains a dataset and a series of online learner state models that are essential to facilitate research on learner engagement modelling. TrueLearn family of models was designed following the "open learner" concept, using humanlyintuitive user representations. This family of scalable, online models also help end-users visualise the learner models, which may in the future facilitate user interaction with their models/recommenders. The extensive documentation and coding examples make the library highly accessible to both machine learning developers and educational data mining and learning analytics practitioners. The experiments show the utility of both the dataset and the library with predictive performance significantly exceeding comparative baseline models. The dataset contains a large amount of AI-related educational videos, which are of interest for building and validating AI-specific educational recommenders.



Paperid:2596
Authors:Zhonghao Shi, Amy O'Connell, Zongjian Li, Siqi Liu, Jennifer Ayissi, Guy Hoffman, Mohammad Soleymani, Maja J. Matarić
University of Southern California, University of Southern California, University of Southern California, University of Southern California, University of Southern California, Cornell University, University of Southern California, University of Southern California
Abstract:
As artificial intelligence (AI) is playing an increasingly important role in our society and global economy, AI education and literacy have become necessary components in college and K12 education to prepare students for an AI-powered society. However, current AI curricula have not yet been made accessible and engaging enough for students and schools from all socio-economic backgrounds with different educational goals. In this work, we developed an open-source learning module for college and high school students, which allows students to build their own robot companion from the ground up. This open platform can be used to provide hands-on experience and introductory knowledge about various aspects of AI, including robotics, machine learning (ML), software engineering, and mechanical engineering. Because of the social and personal nature of a socially assistive robot companion, this module also puts a special emphasis on human-centered AI, enabling students to develop a better understanding of human-AI interaction and AI ethics through hands-on learning activities. With open-source documentation, assembling manuals and affordable materials, students from different socio-economic backgrounds can personalize their learning experience based on their individual educational goals. To evaluate the student-perceived quality of our module, we conducted a usability testing workshop with 15 college students recruited from a minority-serving institution. Our results indicate that our AI module is effective, easy-to-follow, and engaging, and it increases student interest in studying AI/ML and robotics in the future. We hope that this work will contribute toward accessible and engaging AI education in human-AI interaction for college and high school students.



Paperid:2597
Authors:Benjamin Xie, Parth Sarin, Jacob Wolf, Raycelle C. C. Garcia, Victoria Delaney, Isabel Sieh, Anika Fuloria, Deepak Varuvel Dennison, Christine Bywater, Victor R. Lee
Stanford University, Stanford University, Harvard University, Stanford University, Stanford University, Stanford University, Stanford University, Stanford University, Stanford University, Stanford University
Abstract:
High school teachers from many disciplines have growing interests in teaching about artificial intelligence (AI). This crossdisciplinary interest reflects the prevalence of AI tools across society, such as Generative AI tools built upon Large Language Models (LLM). However, high school classes are unique and complex environments, led by teachers with limited time and resources with priorities that vary by class and the students they serve. Therefore, developing curricula about AI for classes that span many disciplines (e.g. history, art, math) must involve centering the expertise of cross-disciplinary teachers. In this study, we conducted five collaborative curricular co-design sessions with eight teachers who taught high school humanities and STEM classes. We sought to understand how teachers considered AI when it was taught in art, math, and social studies contexts, as well as opportunities and challenges they identified with incorporating AI tools into their instruction. We found that teachers considered technical skills and ethical debates around AI, opportunities for "dual exploration" between AI and disciplinary learning, and limitations of AI tools as supporting engagement and reflection but also potentially distracting. We interpreted our findings relative to co-designing adaptable AI curricula to support teaching about and with AI across high school disciplines.



Paperid:2598
Authors:Zhenyu Xu, Victor S. Sheng
Texas Tech University, Texas Tech University
Abstract:
Large language models like ChatGPT can generate humanlike code, posing challenges for programming education as students may be tempted to misuse them on assignments. However, there are currently no robust detectors designed specifically to identify AI-generated code. This is an issue that needs to be addressed to maintain academic integrity while allowing proper utilization of language models. Previous work has explored different approaches to detect AI-generated text, including watermarks, feature analysis, and fine-tuning language models. In this paper, we address the challenge of determining whether a student's code assignment was generated by a language model. First, our proposed method identifies AI-generated code by leveraging targeted masking perturbation paired with comperhesive scoring. Rather than applying a random mask, areas of the code with higher perplexity are more intensely masked. Second, we utilize a fine-tuned CodeBERT to fill in the masked portions, producing subtle modified samples. Then, we integrate the overall perplexity, variation of code line perplexity, and burstiness into a unified score. In this scoring scheme, a higher rank for the original code suggests it's more likely to be AI-generated. This approach stems from the observation that AI-generated codes typically have lower perplexity. Therefore, perturbations often exert minimal influence on them. Conversely, sections of human-composed codes that the model struggles to understand can see their perplexity reduced by such perturbations. Our method outperforms current open-source and commercial text detectors. Specifically, it improves detection of code submissions generated by OpenAI's text-davinci-003, raising average AUC from 0.56 (GPTZero baseline) to 0.87 for our detector.



Paperid:2599
Authors:Garima Agrawal, Kuntal Pal, Yuli Deng, Huan Liu, Ying-Chih Chen
Arizona State University, Arizona State University, Arizona State University, Arizona State University, Arizona State University
Abstract:
Building a skilled cybersecurity workforce is paramount to building a safer digital world. However, the diverse skill set, constantly emerging vulnerabilities, and deployment of new cyber threats make learning cybersecurity challenging. Traditional education methods struggle to cope with cybersecurity's rapidly evolving landscape and keep students engaged and motivated. Different studies on students' behaviors show that an interactive mode of education by engaging through a questionanswering system or dialoguing is one of the most effective learning methodologies. There is a strong need to create advanced AI-enabled education tools to promote interactive learning in cybersecurity. Unfortunately, there are no publicly available standard question-answer datasets to build such systems for students and novice learners to learn cybersecurity concepts, tools, and techniques. The education course material and online question banks are unstructured and need to be validated and updated by domain experts, which is tedious when done manually. In this paper, we propose CyberGen, a novel unification of large language models (LLMs) and knowledge graphs (KG) to generate the questions and answers for cybersecurity automatically. Augmenting the structured knowledge from knowledge graphs in prompts improves factual reasoning and reduces hallucinations in LLMs. We used the knowledge triples from cybersecurity knowledge graphs (AISecKG) to design prompts for ChatGPT and generate questions and answers using different prompting techniques. Our question-answer dataset, CyberQ, contains around 4k pairs of questions and answers. The domain expert manually evaluated the random samples for consistency and correctness. We train the generative model using the CyberQ dataset for question answering task.



Paperid:2600
Authors:Li-Hsin Chang, Filip Ginter
University of Turku, University of Turku
Abstract:
Automatic short answer grading (ASAG) seeks to mitigate the burden on teachers by leveraging computational methods to evaluate studentconstructed text responses. Large language models (LLMs) have recently gained prominence across diverse applications, with educational contexts being no exception. The sudden rise of ChatGPT has raised expectations that LLMs can handle numerous tasks, including ASAG. This paper aims to shed some light on this expectation by evaluating two LLM-based chatbots, namely ChatGPT built on GPT-3.5 and GPT-4, on scoring short-question answers under zero-shot and one-shot settings. Our data consists of 2000 student answers in Finnish from ten undergraduate courses. Multiple perspectives are taken into account during this assessment, encompassing those of grading system developers, teachers, and students. On our dataset, GPT-4 achieves a good QWK score (0.6+) in 44% of one-shot settings, clearly outperforming GPT-3.5 at 21%. We observe a negative association between student answer length and model performance, as well as a correlation between a smaller standard deviation among a set of predictions and lower performance. We conclude that while GPT-4 exhibits signs of being a capable grader, additional research is essential before considering its deployment as a reliable autograder.



Paperid:2601
Authors:Clayton Cohn, Nicole Hutchins, Tuan Le, Gautam Biswas
Vanderbilt University, Vanderbilt University, DePauw University, Vanderbilt University
Abstract:
This paper explores the use of large language models (LLMs) to score and explain shortanswer assessments in K-12 science. While existing methods can score more structured math and computer science assessments, they often do not provide explanations for the scores. Our study focuses on employing GPT-4 for automated assessment in middle school Earth Science, combining few-shot and active learning with chain-of-thought reasoning. Using a human-in-the-loop approach, we successfully score and provide meaningful explanations for formative assessment responses. A systematic analysis of our method's pros and cons sheds light on the potential for human-in-the-loop techniques to enhance automated grading for open-ended science assessments.



Paperid:2602
Authors:Fahmid Morshed Fahid, Jonathan Rowe, Yeojin Kim, Shashank Srivastava, James Lester
North Carolina State University, North Carolina State University, North Carolina State University, University of North Carolina Chapel Hill, North Carolina State University
Abstract:
Pedagogical planners can provide adaptive support to students in narrativecentered learning environments by dynamically scaffolding student learning and tailoring problem scenarios. Reinforcement learning (RL) is frequently used for pedagogical planning in narrative-centered learning environments. However, RL-based pedagogical planning raises significant challenges due to the scarcity of data for training RL policies. Most prior work has relied on limited-size datasets and offline RL techniques for policy learning. Unfortunately, offline RL techniques do not support on-demand exploration and evaluation, which can adversely impact the quality of induced policies. To address the limitation of data scarcity and offline RL, we propose INSIGHT, an online RL framework for training data-driven pedagogical policies that optimize student learning in narrative-centered learning environments. The INSIGHT framework consists of three components: a narrative-centered learning environment simulator, a simulated student agent, and an RL-based pedagogical planner agent, which uses a reward metric that is associated with effective student learning processes. The framework enables the generation of synthetic data for on-demand exploration and evaluation of RL-based pedagogical planning. We have implemented INSIGHT with OpenAI Gym for a narrative-centered learning environment testbed with rule-based simulated student agents and a deep Q-learning-based pedagogical planner. Our results show that online deep RL algorithms can induce near-optimal pedagogical policies in the INSIGHT framework, while offline deep RL algorithms only find suboptimal policies even with large amounts of data.



Paperid:2603
Authors:Shreyansh Gupta, Abhishek Unnam, Kuldeep Yadav, Varun Aggarwal
SHL Labs, Gurugram, India, SHL Labs, Gurugram, India, SHL Labs, Gurugram, India, SHL Labs, Gurugram, India
Abstract:
Automatic speech scoring is crucial in language learning, providing targeted feedback to language learners by assessing pronunciation, fluency, and other speech qualities. However, the scarcity of humanlabeled data for languages beyond English poses a significant challenge in developing such systems. In this work, we propose a Language-Independent scoring approach to evaluate speech without relying on labeled data in the target language. We introduce a multilingual speech scoring system that leverages representations from the wav2vec 2.0 XLSR model and a force-alignment technique based on CTC-Segmentation to construct speech features. These features are used to train a machine learning model to predict pronunciation and fluency scores. We demonstrate the potential of our method by predicting expert ratings on a speech dataset spanning five languages - English, French, Spanish, German and Portuguese, and comparing its performance against Language-Specific models trained individually on each language, as well as a jointly-trained model on all languages. Results indicate that our approach shows promise as an initial step towards a universal language independent speech scoring.



Paperid:2604
Authors:Jay Mahajan, Samuel Hum, Jack Henhapl, Diya Yunus, Matthew Gadbury, Emi Brown, Jeff Ginger, H. Chad Lane
University of Illinois - Urbana Champaign, University of Illinois - Urbana Champaign, University of Illinois - Urbana Champaign, University of Illinois - Urbana Champaign, University of Illinois - Urbana Champaign, University of Illinois - Urbana Champaign, University of Illinois - Urbana Champaign, University of Illinois - Urbana Champaign
Abstract:
MineObserver 2.0 is an AI framework that uses Computer Vision and Natural Language Processing for assessing the accuracy of learnergenerated descriptions of Minecraft images that include some scientifically relevant content. The system automatically assesses the accuracy of participant observations, written in natural language, made during science learning activities that take place in Minecraft. We demonstrate our system working in real-time and describe a teacher dashboard to showcase observations, both of which advance our previous work. We present the results of a study showing that MineObserver 2.0 improves over its predecessor both in perceived accuracy of the system's generated descriptions as well as in usefulness of the system's feedback. In future work, we intend improve system generated descriptions to give more teacher control and shift the system to perform continuous learning to more rapidly respond to novel observations made by learners.



Paperid:2605
Authors:Chancharik Mitra, Mihran Miroyan, Rishi Jain, Vedant Kumud, Gireeja Ranade, Narges Norouzi
University of California, Berkeley, University of California, Berkeley, University of California, Berkeley, University of California, Berkeley, University of California, Berkeley, University of California, Berkeley
Abstract:
This paper focuses on using Large Language Models to support teaching assistants in answering questions on large student forums such as Piazza and EdSTEM. Since student questions on these forums are often closely tied to specific aspects of the institution, instructor, and course delivery, generalpurpose LLMs do not directly do well on this task. We introduce RetLLM-E, a method that combines text-retrieval and prompting approaches to enable LLMs to provide precise and high-quality answers to student questions. When presented with a student question, our system initiates a two-step process. First, it retrieves relevant context from (i) a dataset of student questions addressed by course instructors (Q&A Retrieval) and (ii) relevant segments of course materials (Document Retrieval). RetLLM-E then prompts LLM using the retrieved text and an engineered prompt structure to yield an answer optimized for the student question. We present a set of quantitative and human evaluation experiments, comparing our method to ground truth answers to questions in a test set of actual student questions. Our results demonstrate that our approach provides higher-quality responses to course-related questions than an LLM operating without context or relying solely on retrieval-based context. RetLLM-E can easily be adopted in different courses, providing instructors and students with context-aware automatic responses.



Paperid:2606
Authors:Hadar Mulian, Segev Shlomov, Lior Limonad, Alessia Noccaro, Silvia Buscaglione
IBM Research - Israel, IBM Research - Israel, IBM Research - Israel, Neurorobotics Lab, School of Engineering, Newcastle University, Newcastle Upon Tyne, United Kingdom, NEXT Lab, Universita’ Campus Bio-Medico di Roma, Rome, Italy
Abstract:
Motor skills, especially fine motor skills like handwriting, play an essential role in academic pursuits and everyday life. Traditional methods to teach these skills, although effective, can be timeconsuming and inconsistent. With the rise of advanced technologies like robotics and artificial intelligence, there is increasing interest in automating such teaching processes. In this study, we examine the potential of a virtual AI teacher in emulating the techniques of human educators for motor skill acquisition. We introduce an AI teacher model that captures the distinct characteristics of human instructors. Using a reinforcement learning environment tailored to mimic teacher-learner interactions, we tested our AI model against four guiding hypotheses, emphasizing improved learner performance, enhanced rate of skill acquisition, and reduced variability in learning outcomes. Our findings, validated on synthetic learners, revealed significant improvements across all tested hypotheses. Notably, our model showcased robustness across different learners and settings and demonstrated adaptability to handwriting. This research underscores the potential of integrating Imitation and Reinforcement Learning models with robotics in revolutionizing the teaching of critical motor skills.



Paperid:2607
Authors:Lin Ni, Sijie Wang, Zeyu Zhang, Xiaoxuan Li, Xianda Zheng, Paul Denny, Jiamou Liu
Huazhong Agricultural University The University of Auckland, The University of Auckland, Huazhong Agricultural University, The University of Auckland, The University of Auckland, The University of Auckland, The University of Auckland
Abstract:
Learnersourcing offers great potential for scalable education through student content creation. However, predicting student performance on learnersourced questions, which is essential for personalizing the learning experience, is challenging due to the inherent noise in studentgenerated data. Moreover, while conventional graph-based methods can capture the complex network of student and question interactions, they often fall short under cold start conditions where limited student engagement with questions yields sparse data. To address both challenges, we introduce an innovative strategy that synergizes the potential of integrating Signed Graph Neural Networks (SGNNs) and Large Language Model (LLM) embeddings. Our methodology employs a signed bipartite graph to comprehensively model student answers, complemented by a contrastive learning framework that enhances noise resilience. Furthermore, LLM's contribution lies in generating foundational question embeddings, proving especially advantageous in addressing cold start scenarios characterized by limited graph data. Validation across five real-world datasets sourced from the PeerWise platform underscores our approach's effectiveness. Our method outperforms baselines, showcasing enhanced predictive accuracy and robustness.



Paperid:2608
Authors:Zefang Yu, Mingye Xie, Jingsheng Gao, Ting Liu, Yuzhuo Fu
Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Understanding student behavior in educational settings is critical in improving both the quality of pedagogy and the level of student engagement. While various AIbased models exist for classroom analysis, they tend to specialize in limited tasks and lack generalizability across diverse educational environments. Additionally, these models often fall short in ensuring student privacy and in providing actionable insights accessible to educators. To bridge this gap, we introduce a unified, end-to-end framework by leveraging temporal action detection techniques and advanced large language models for a more nuanced student behavior analysis. Our proposed framework provides an end-to-end pipeline that starts with raw classroom video footage and culminates in the autonomous generation of pedagogical reports. It offers a comprehensive and scalable solution for student behavior analysis. Experimental validation confirms the capability of our framework to accurately identify student behaviors and to produce pedagogically meaningful insights, thereby setting the stage for future AI-assisted educational assessments.



Paperid:2609
Authors:Zhengdong Zhang, Zihan Dong, Yang Shi, Thomas Price, Noboru Matsuda, Dongkuan Xu
Georgia Institute of Technology, North Carolina State University, North Carolina State University, North Carolina State University, North Carolina State University, North Carolina State University
Abstract:
The rapid evolution of artificial intelligence (AI), specifically large language models (LLMs), has opened opportunities for various educational applications. This paper explored the feasibility of utilizing ChatGPT, one of the most popular LLMs, for automating feedback for Java programming assignments in an introductory computer science (CS1) class. Specifically, this study focused on three questions: 1) To what extent do students view LLMgenerated feedback as formative? 2) How do students see the comparative affordances of feedback prompts that include their code, vs. those that exclude it? 3) What enhancements do students suggest for improving LLM-generated feedback? To address these questions, we generated automated feedback using the ChatGPT API for four lab assignments in a CS1 class. The survey results revealed that students perceived the feedback as aligning well with formative feedback guidelines established by Shute. Additionally, students showed a clear preference for feedback generated by including the students' code as part of the LLM prompt, and our thematic study indicated that the preference was mainly attributed to the specificity, clarity, and corrective nature of the feedback. Moreover, this study found that students generally expected specific and corrective feedback with sufficient code examples, but had diverged opinions on the tone of the feedback. This study demonstrated that ChatGPT could generate Java programming assignment feedback that students perceived as formative. It also offered insights into the specific improvements that would make the ChatGPT-generated feedback useful for students.



Paperid:2610
Authors:Safinah Ali, Prerna Ravi, Katherine Moore, Hal Abelson, Cynthia Breazeal
Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology
Abstract:
Textto-image generation (TTIG) technologies are Artificial Intelligence (AI) algorithms that use natural language algorithms in combination with visual generative algorithms. TTIG tools have gained popularity in recent months, garnering interest from non-AI experts, including educators and K-12 students. While they have exciting creative potential when used by K-12 learners and educators for creative learning, they are also accompanied by serious ethical implications, such as data privacy, spreading misinformation, and algorithmic bias. Given the potential learning applications, social implications, and ethical concerns, we designed 6-hour learning materials to teach K-12 teachers from diverse subject expertise about the technical implementation, classroom applications, and ethical implications of TTIG algorithms. We piloted the learning materials titled “Demystify text-to-image generative tools for K-12 educators" with 30 teachers across two workshops with the goal of preparing them to teach about and use TTIG tools in their classrooms. We found that teachers demonstrated a technical, applied and ethical understanding of TTIG algorithms and successfully designed prototypes of teaching materials for their classrooms.



Paperid:2611
Authors:Safinah Ali, Prerna Ravi, Randi Williams, Daniella DiPaola, Cynthia Breazeal
Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology
Abstract:
Generative AI tools introduce new and accessible forms of media creation for youth. They also raise ethical concerns about the generation of fake media, data protection, privacy and ownership of AIgenerated art. Since generative AI is already being used in products used by youth, it is critical that they understand how these tools work and how they can be used or misused. In this work, we facilitated students’ generative AI learning through expression of their imagined future identities. We designed a learning workshop - Dreaming with AI - where students learned about the inner workings of generative AI tools, used text-to-image generation algorithms to create their imaged future dreams, reflected on the potential benefits and harms of generative AI tools and voiced their opinions about policies for the use of these tools in classrooms. In this paper, we present the learning activities and experiences of 34 high school students who engaged in our workshops. Students reached creative learning objectives by using prompt engineering to create their future dreams, gained technical knowledge by learning the abilities, limitations, text-visual mappings and applications of generative AI, and identified most potential societal benefits and harms of generative AI.



Paperid:2612
Authors:Elham Buxton, Elahe Javadi, Matthew Hagaman
University of Illinois at Springfield, Illinois State University, Illinois State University
Abstract:
A few states (e.g., Maryland, Georgia, and Florida) have initiated efforts to incorporate artificial intelligence outcomes in K12 education but others are still relying on informal spaces for learning and literacy in this area. In this manuscript, we share the curriculum and content of an informal effort focused on students in grades 7-10. We combined artificial intelligence competencies with Internet of Things skills to enable meaningful learning covering all Five Big Ideas in AI. In our one-week summer camp, students experimented with perceptions by working with vision, infrared, and ultrasonic sensors. They learned about representation through work with neural network playgrounds. Students engaged in supervised learning of an image processing model and used the model to control the actions of a robot car. Natural interactions and societal impacts were assessed as students observed the robot car's behavior. Results demonstrate that our curriculum was successful in achieving its objectives. Excluding the robot car kit, the curriculum was created using free platforms and tools. This program could be replicated in informal settings by any educator or collaborator with a computer science background. This paper describes our summer camp curriculum, its components and their implementation, the lessons learned, and potential future enhancements.



Paperid:2613
Authors:Hansol Lim, Wookhee Min, Jessica Vandenberg, Veronica Cateté, Bradford Mott
North Carolina State University, North Carolina State University, North Carolina State University, North Carolina State University, North Carolina State University
Abstract:
With the growing prevalence of AI, the need for K12 AI education is becoming more crucial, which is prompting active research in developing engaging and age-appropriate AI learning activities. Efforts are underway, such as those by the AI4K12 initiative, to establish guidelines for organizing K- 12 AI education; however, effective instructional resources are needed by educators. In this paper, we describe our work to design, develop, and implement an unplugged activity centered on facial recognition technology for middle school students. Facial recognition is integrated into a wide range of applications throughout daily life, which makes it a familiar and engaging tool for students and an effective medium for conveying AI concepts. Our unplugged activity, “Guess Whose Face,” is designed as a board game that focuses on Representation and Reasoning from AI4K12’s 5 Big Ideas in AI. The game is crafted to enable students to develop AI competencies naturally through physical interaction. In the game, one student uses tracing paper to extract facial features from a familiar face shown on a card, such as a cartoon character or celebrity, and then other students try to guess the identity of the hidden face. We discuss details of the game, its iterative refinement, and initial findings from piloting the activity during a summer camp for rural middle school students.



Paperid:2614
Authors:Elizabeth Radday, Matt Mervis
EdAdvance, EdAdvance
Abstract:
Generative artificial intelligence (AI) is swiftly cementing its role as an indispensable tool for students transitioning from K12 to higher education and professional spheres. Yet, harnessing its full potential requires more than mere familiarity. Students must be equipped with the skills to engage with AI both productively and ethically. Left unchecked, AI usage can pose risks, especially if students lack proper guidance or understanding of their actions. Moreover, effective interaction with AI necessitates skills in prompt engineering to yield desired outcomes. Sidekick Academy is a digital online platform where students can safely experiment with and learn about AI. This article delves into the genesis of Sidekick Academy, offering a glimpse into its lessons on how to use AI and complex debate on ethical use. It also sheds light on the academy's "sandbox" - a secure space for students to explore AI without jeopardizing their safety or privacy.



Paperid:2615
Authors:Deepak Varuvel Dennison, Raycelle C. C. Garcia, Parth Sarin, Jacob Wolf, Christine Bywater, Benjamin Xie, Victor R. Lee
Stanford University, Stanford University, Stanford University, Harvard University, Stanford University, Stanford University, Stanford University
Abstract:
In an age where Large Language Models (LLMs) expedite the generation of text, the skills for critically evaluating and creating meaningful text using these models are often lacking. To help classroom teachers address this, we introduce Prompty, a specialized teaching tool codesigned to facilitate both critical and effective use of LLMs. Prompty serves multiple learning goals: it allows students to critically evaluate text generated by LLMs, aids in their writing practice, and provides a deeper understanding of how LLMs function—all within a student-friendly environment secured by essential guardrails. Prompty was co-designed in collaboration with high school teachers as part of CRAFT, an initiative by Stanford University to promote AI literacy. It was pilot-tested in a high school English class to serve as an AI writing assistant, focusing on the critical evaluation of machine-generated text. This trial yielded preliminary evidence that attests to the tool's effectiveness in fulfilling its educational goals. The findings from the pilot study indicate that easy-to-use tools like Prompty have great potential. These tools can be adapted to fit the goals of individual teachers. They can help in achieving subject-specific learning goals while serving as an effective way to teach AI concepts in high school.



Paperid:2616
Authors:Randi Williams, Sharifa Alghowinem, Cynthia Breazeal
Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology
Abstract:
Artificial Intelligence (AI) is permeating almost every area of society, reshaping how many people, including youth, navigate the world. Despite the increased presence of AI, most people lack a baseline knowledge of how AI works. Moreover, social barriers often hinder equal access to AI courses, perpetuating disparities in participation in the field. To address this, it is crucial to design AI curricula that are effective, inclusive, and relevant, especially to learners from backgrounds that are historically excluded from working in tech. In this paper, we present AI for Wellbeing, a curriculum where students explore conversational AI and the ethical considerations around using it to promote wellbeing. We specifically designed content, educator materials, and educational technologies to meet the interests and needs of students and educators from diverse backgrounds. We piloted AI for Wellbeing in a 5day virtual workshop with middle school teachers and students. Then, using a mixed-methods approach, we analyzed students' work and teachers' feedback. Our results suggest that the curriculum content and design effectively engaged students, enabling them to implement meaningful AI projects for wellbeing. We hope that the design of this curriculum and insights from our evaluation will inspire future efforts to create culturally relevant K-12 AI curricula.



Paperid:2617
Authors:Helen Zhang, Irene Lee, Katherine Moore
Boston College, Massachusetts Institute of Technology, Massachusetts Institute of Technology
Abstract:
Artificial intelligence (AI) has rapidly pervaded and reshaped almost all walks of life, but efforts to promote AI literacy in K12 schools remain limited. There is a knowledge gap in how to prepare teachers to teach AI literacy in inclusive classrooms and how teacher-led classroom implementations can impact students. This paper reports a comparison study to investigate the effectiveness of an AI literacy curriculum when taught by classroom teachers. The experimental group included 89 middle school students who learned an AI literacy curriculum during regular school hours. The comparison group consisted of 69 students who did not learn the curriculum. Both groups completed the same pre and post-test. The results show that students in the experimental group developed a deeper understanding of AI concepts and more positive attitudes toward AI and its impact on future careers after the curriculum than those in the comparison group. This shows that the teacher-led classroom implementation successfully equipped students with a conceptual understanding of AI. Students achieved significant gains in recognizing how AI is relevant to their lives and felt empowered to thrive in the age of AI. Overall this study confirms the potential of preparing K-12 classroom teachers to offer AI education in classrooms in order to reach learners of diverse backgrounds and broaden participation in AI literacy education among young learners.



Paperid:2618
Authors:Grisha Bandodkar, Shyam Agarwal, Athul Krishna Sughosh, Sahilbir Singh, Taeyeong Choi
University of California, Davis, University of California, Davis, University of California, Davis, University of California, Davis, Kennesaw State University
Abstract:
The proliferation of Automatic Speech Recognition (ASR) systems has revolutionized translation and transcription. However, challenges persist in ensuring inclusive communication for nonnative English speakers. This study quantifies the gap between accented and native English speech using Wav2Vec 2.0, a state-of-the-art transformer model. Notably, we found that accented speech exhibits significantly higher word error rates of 30-50%, in contrast to native speakers’ 2-8% (Baevski et al. 2020). Our exploration extends to leveraging accessible online datasets to highlight the potential of enhancing speech recognition by fine-tuning the Wav2Vec 2.0 model. Through experimentation and analysis, we highlight the challenges with training models on accented speech. By refining models and addressing data quality issues, our work presents a pipeline for future investigations aimed at developing an integrated system capable of effectively engaging with a broader range of individuals with diverse backgrounds. Accurate recognition of accented speech is a pivotal step toward democratizing AI-driven communication products.



Paperid:2619
Authors:Satyam Goyal, Kavya Sasikumar, Rohan Sheth, Akash Seelam, Taeyeong Choi, Xin Liu
University of Michigan, Ann Arbor, University of California, Davis, University of California, Davis, University of California, Davis, Kennesaw State University, University of California, Davis
Abstract:
Individuals with color vision deficiencies (CVDs) often face significant challenges in accessing vital information for decisionmaking. In response, we introduce EnColor—a deep Encoder-decoder Color corrector for images, enabling individuals with CVDs to perceive the contents in originally intended colorization. Our network architecture is designed to effectively capture essential visual features for reconstructing standard images into color-corrected versions. In particular, our training pipeline is integrated with a CVD simulator so as to ensure the fidelity of output throughout the lens of individuals with impaired color vision. For evaluation, we focus primarily on tomato images, considering the profound impact of color vision deficiencies on practical domains like agri-food systems. Our quantitative results demonstrate that the EnColor model achieves over 16.8% improvement over previously introduced algorithms in terms of color retention, supporting our design choices. Furthermore, a survey with 43 participants provides subjective assessments with the highest scores on our method. Additionally, specific visual examples are presented to highlight accurately restored colors. We also publicly share all our codes of EnColor as well as the baseline methods to ensure reproducibility and facilitate more studies in CVD correction.



Paperid:2620
Authors:Furong Jia, Kevin Wang, Yixiang Zheng, Defu Cao, Yan Liu
University of Southern California, University of Southern California, University of Southern California, University of Southern California, University of Southern California
Abstract:
Time series forecasting is an essential area of machine learning with a wide range of realworld applications. Most of the previous forecasting models aim to capture dynamic characteristics from uni-modal numerical historical data. Although extra knowledge can boost the time series forecasting performance, it is hard to collect such information. In addition, how to fuse the multimodal information is non-trivial. In this paper, we first propose a general principle of collecting the corresponding textual information from different data sources with the help of modern large language models (LLM). Then, we propose a prompt-based LLM framework to utilize both the numerical data and the textual information simultaneously, named GPT4MTS. In practice, we propose a GDELT-based multimodal time series dataset for news impact forecasting, which provides a concise and well-structured version of time series dataset with textual information for further research in communication. Through extensive experiments, we demonstrate the effectiveness of our proposed method on forecasting tasks with extra-textual information.



Paperid:2621
Authors:Adilson Medronha, Luís Lima, Janaína Claudio, Lucas Kupssinskü, Rodrigo C. Barros
Pontifical Catholic University of Rio Grande do Sul, Pontifical Catholic University of Rio Grande do Sul, Pontifical Catholic University of Rio Grande do Sul, Pontifical Catholic University of Rio Grande do Sul, Pontifical Catholic University of Rio Grande do Sul
Abstract:
Sign language is a visual and gestural communication system used by deaf and hearingimpaired people. Despite numerous deep learning methods proposed for automatic interpretation, a gap persists in developing applications that effectively utilize these models for assisting sign language studies and inclusion. We introduce LERMO (https://lermo.app/), a web game merging machine learning and gamification to enhance sign language fingerspelling. Inspired by Wordle™, LERMO offers an interactive word-guessing game where users can play using a video camera. We create a new dataset of labeled landmark fingerspelling and design our model to ensure optimal speed and efficiency to run on a web browser. We survey approximately 40 users, which find LERMO user-friendly and innovative. From those, 95% believe LERMO could be used to enhance fingerspelling skills.



Paperid:2622
Authors:Hoang Nhat Khang Vo, Duc Dong Le, Tran Minh Dat Phan, Tan Sang Nguyen, Quoc Nguyen Pham, Ngoc Oanh Tran, Quang Duc Nguyen, Tran Minh Hieu Vo, Tho Quan
Ho Chi Minh University of Technology, Ho Chi Minh University of Technology, Ho Chi Minh University of Technology, Ho Chi Minh University of Technology, Ho Chi Minh University of Technology, Ho Chi Minh University of Technology, Ho Chi Minh University of Technology, Ho Chi Minh University of Technology, Ho Chi Minh University of Technology
Abstract:
The Bahnar, a minority ethnic group in Vietnam with ancient roots, hold a language of deep cultural and historical significance. The government is prioritizing the preservation and dissemination of Bahnar language through online availability and crossgenerational communication. Recent AI advances, including Neural Machine Translation (NMT), have transformed translation with improved accuracy and fluency, fostering language revitalization through learning, communication, and documentation. In particular, NMT enhances accessibility for Bahnar language speakers, making information and content more available. However, translating Vietnamese to Bahnar language faces practical hurdles due to resource limitations, particularly in the case of Bahnar language as an extremely low-resource language. These challenges encompass data scarcity, vocabulary constraints, and a lack of fine-tuning data. To address these, we propose transfer learning from selected pre-trained models to optimize translation quality and computational efficiency, capitalizing on linguistic similarities between Vietnamese and Bahnar language. Concurrently, we apply tailored augmentation strategies to adapt machine translation for the Vietnamese-Bahnar language context. Our approach is validated through superior results on bilingual Vietnamese-Bahnar language datasets when compared to baseline models. By tackling translation challenges, we help revitalize Bahnar language, ensuring information flows freely and the language thrives.



Paperid:2623
Authors:Todd W. Neller, Pia Bideau, David Bierbach, Wolfgang Hönig, Nir Lipovetzky, Christian Muise, Lino Coria, Claire Wong, Stephanie Rosenthal, Yu Lu, Ming Gao, Jingjing Zhang
Gettysburg College, Technical University of Berlin, Humboldt University of Berlin, Technical University of Berlin, The University of Melbourne, Queen’s University, Northeastern University, Carnegie Mellon University, Carnegie Mellon University, Beijing Normal University, Shanghai Normal University, Beijing Normal University
Abstract:
The Model AI Assignments session seeks to gather and disseminate the best assignment designs of the Artificial In- telligence (AI) Education community. Recognizing that as- signments form the core of student learning experience, we here present abstracts of five AI assignments from the 2024 session that are easily adoptable, playfully engaging, and flexible for a variety of instructor needs. Assignment spec- ifications and supporting resources may be found at http://modelai.gettysburg.edu.



Paperid:2624
Authors:Shishir Adhikari
University of Illinois Chicago
Abstract:
Causal inference in relational data should account for the nonIID nature of the data and the interference phenomenon, which occurs when a unit's outcome is influenced by the treatments or outcomes of others. Existing solutions to causal inference under interference consider either homogeneous influence from peers or specific heterogeneous influence contexts (e.g., local neighborhood structure). This thesis investigates causal reasoning in relational data and the automated discovery of heterogeneous causal effects under arbitrary heterogeneous peer influence contexts and effect modification.



Paperid:2625
Authors:Pedram Agand
Simon Fraser University
Abstract:
In the domain of endto-end autonomous driving, conventional sensor fusion techniques exhibit inadequacies, particularly when facing challenging scenarios with numerous dynamic agents. Imitation learning hampers the performance by the expert and encounters issues with out-of-distribution challenges. To overcome these limitations, we propose a transformer-based algorithm designed to fuse diverse representations from RGB-D cameras through knowledge distillation. This approach leverages insights from multi-task teachers to enhance the learning capabilities of single-task students, particularly in a Reinforcement Learning (RL) setting. Our model consists of two primary modules: the perception module, responsible for encoding observation data acquired from RGB-D cameras and performing tasks such as semantic segmentation, semantic depth cloud mapping (SDC), ego vehicle speed estimation, and traffic light state recognition. Subsequently, the control module decodes these features, incorporating additional data, including a rough simulator for static and dynamic environments, to anticipate waypoints within a latent feature space. Vehicular controls (e.g., steering, throttle, and brake) are obtained directly from measurement features and environmental states using the RL agent and are further refined by a PID algorithm that dynamically follows waypoints. The model undergoes rigorous evaluation and comparative analysis on the CARLA simulator across various scenarios, encompassing normal to adversarial conditions. Our code is available at https://github.com/pagand/e2etransfuser/ to facilitate future studies.



Paperid:2626
Authors:Nathan Arnold
University of Kentucky
Abstract:
Expanding on a 2017 paper by Siler that introduced tiered coalition formation games, I have introduced a variant game and examined the stabilizability of both the original game and its variant. My thesis will contain further theoretical stability findings and the results and interpretation of a simulation based upon real data from video game matchups.



Paperid:2627
Authors:Saugat Aryal
University College Dublin
Abstract:
Most of the recent works on posthoc example-based eXplainable AI (XAI) methods revolves around employing counterfactual explanations to provide justification of the predictions made by AI systems. Counterfactuals show what changes to the input-features change the output decision. However, a lesser-known, special-case of the counterfacual is the semi-factual, which provide explanations about what changes to the input-features do not change the output decision. Semi-factuals are potentially as useful as counterfactuals but have received little attention in the XAI literature. My doctoral research aims to establish a comprehensive framework for the use of semi-factuals in XAI by developing novel methods for their computation, supported by user tests.



Paperid:2628
Authors:Salena Torres Ashton
School of Information, University of Arizona
Abstract:
This research combines multi agent planning, the psycholinguistics of question asking, procedural grounded theory, and hierarchical task networks to represent domains for automated planning.



Paperid:2629
Authors:Amine Barrak
University of Quebec at Chicoutimi, Québec, Canada
Abstract:
My thesis focuses on the integration of serverless computing with Peer to Peer (P2P) architectures in distributed Machine Learning (ML). This research aims to harness the decentralized, resilient nature of P2P systems, combined with the scalability and automation of serverless platforms. We explore using databases not just for communication but also for indatabase model updates and gradient averaging, addressing the challenges of statelessness in serverless environments.



Paperid:2630
Authors:Joachim Baumann
University of Zurich
Abstract:
Today's machine learning (ML) applications predominantly adhere to a standard paradigm: the decision maker designs the algorithm by optimizing a model for some objective function. While this has proven to be a powerful approach in many domains, it comes with inherent side effects: the power over the algorithmic outcomes lies solely in the hands of the algorithm designer, and alternative objectives, such as fairness, are often disregarded. This is particularly problematic if the algorithm is used to make consequential decisions that affect peoples lives. My research focuses on developing principled methods to characterize and address the mismatch between these different objectives.



Paperid:2631
Authors:Raffaele Galliera
The University of West Florida The Institute for Human and Machine Cognition
Abstract:
This research explores optimizing communication tasks with (MultiAgent) Reinforcement Learning (RL/MARL) in Point-to-Point and Group Communication (GC) networks. The study initially applied RL for Congestion Control in networks with dynamic link properties, yielding competitive results. Then, it focused on the challenge of effective message dissemination in GC networks, by framing a novel game-theoretic formulation and designing methods to solve the task based on MARL and Graph Convolution. Future research will deepen the exploration of MARL in GC. This will contribute to both academic knowledge and practical advancements in the next generation of communication protocols.



Paperid:2632
Authors:Balint Gyevnar
University of Edinburgh
Abstract:
Autonomous systems fulfil an increasingly important role in our societies, however, AIpowered systems have seen less success over the years, as they are expected to tackle a range of social, legal, or technological challenges and modern neural network-based AI systems cannot yet provide guarantees to many of these challenges. Particularly important is that these systems are black box decision makers, eroding human oversight, contestation, and agency. To address this particular concern, my thesis focuses on integrating social explainable AI with cognitive methods and natural language processing to shed light on the internal processes of autonomous systems in a way accessible to lay users. I propose a causal explanation generation model for decision-making called CEMA based on counterfactual simulations in multi-agent systems. I also plan to integrate CEMA with a broader natural language processing pipeline to support targeted and personalised explanations that address people's cognitive biases. I hope that my research will have a positive impact on the public acceptance of autonomous agents by building towards more trustworthy AI.



Paperid:2633
Authors:Md. Khairul Islam
University of Virginia
Abstract:
The widespread use of Artificial Intelligence (AI) has highlighted the importance of understanding AI model behavior. This understanding is crucial for practical decisionmaking, assessing model reliability, and ensuring trustworthiness. Interpreting time series forecasting models faces unique challenges compared to image and text data. These challenges arise from the temporal dependencies between time steps and the evolving importance of input features over time. My thesis focuses on addressing these challenges by aiming for more precise explanations of feature interactions, uncovering spatiotemporal patterns, and demonstrating the practical applicability of these interpretability techniques using real-world datasets and state-of-the-art deep learning models.



Paperid:2634
Authors:Changhoon Kim
Arizona State University
Abstract:
My doctoral research delves into the realm of generative model fingerprinting, aiming to assign responsibility for the generated images. I introduce frameworks that modify generative models to incorporate each user's distinct digital fingerprint. This ensures that every piece of generated content carries a traceable identifier linked to its originator. The primary objective of my research is to achieve optimal attribution accuracy while ensuring minimal compromise on the model's performance. Additionally, I present strategies designed to enhance robustness against common adversarial manipulations, which malicious users might employ to obscure or remove these fingerprints.



Paperid:2635
Authors:Andrii Krutsylo
Institute of Computer Science Polish Academy of Sciences
Abstract:
In a Continual Learning setting, models are trained on data with occasional distribution shifts, resulting in forgetting the information learned before each shift. Experience Replay (ER) addresses this challenge by retaining part of the old training samples and replaying them alongside current data, improving the model's understanding of the overall distribution in training batches. The crucial factor in ER performance is the diversity of samples within batches. The impact of sample diversity across a sequence of batches is investigated, introducing a new metric and an associated approach to assess and leverage this diversity. This exploration opens up significant potential for future work, as various strategies can be devised to ensure interbatch diversity. Achieving optimal results may involve striking a balance between this novel metric and other inherent properties of a batch or sequence.



Paperid:2636
Authors:Michael S. Lee
Carnegie Mellon University
Abstract:
Demonstrations are a powerful way of increasing the transparency of AI policies to humans. Though we can approximately model human learning from demonstrations as inverse reinforcement learning, we note that human learning can differ from algorithmic learning in key ways, e.g. humans are computationally limited and may sometimes struggle to understand all of the nuances of a demonstration. Unlike related work that provide demonstrations to humans that simply maximize information gain, I leverage concepts from the human education literature, such as the zone of proximal development and scaffolding, to show demonstrations that balance informativeness and difficulty of understanding to maximize human learning.



Paperid:2637
Authors:Sukanya Mandal
Dublin City University
Abstract:
A Smart City is one that makes better use of city data to make our communities better places to live. Typically, this has 3 components: sensing (data collection), analysis and actuation. Privacy, particularly as it relates to citizen's data, is a crosscutting theme. A Digital Twin (DT) is a virtual replica of a real-world physical entity. Cognitive Digital Twins (CDT) are DTs enhanced with cognitive AI capabilities. Both DTs and CDTs have seen adoption in the manufacturing and industrial sectors however cities are slow to adopt these because of privacy concerns. This work attempts to address these concerns by proposing a Privacy Preserving Federated Learning (PPFL) based Cognitive Digital Twin framework for Smart Cities.



Paperid:2638
Authors:Deepa Muralidhar
Georgia State University
Abstract:
Artificial intelligence system architects can increase user trust by designing systems that are inherently transparent. We propose the idea of representing an AI system as an amalgamation of the AI Model (algorithms), data (input and output, including outcomes), and the user interface with visual interpretations (e.g. graphs, Venn diagrams). By designing human controls and feedback mechanisms for AI systems that allow users to exert control over them we can integrate transparency into existing user interfaces. Our plan is to design prototypes of transparent user interfaces for AI systems using wellknown usability principles. By conducting surveys we will study their impact to see if these principles help the user to work with the AI system with confidence and if the user perceives the system to be adequately transparent.



Paperid:2639
Authors:Rashmeet Kaur Nayyar
Arizona State University
Abstract:
Reinforcement Learning (RL) in complex environments presents many challenges: agents require learning concise representations of both environments and behaviors for efficient reasoning and generalizing experiences to new, unseen situations. However, RL approaches can be sampleinefficient and difficult to scale, especially in long-horizon sparse reward settings. To address these issues, the goal of my doctoral research is to develop methods that automatically construct semantically meaningful state and temporal abstractions for efficient transfer and generalization. In my work, I develop hierarchical approaches for learning transferable, generalizable knowledge in the form of symbolically represented options, as well as for integrating search techniques with RL to solve new problems by efficiently composing the learned options. Empirical results show that the resulting approaches effectively learn and transfer knowledge, achieving superior sample efficiency compared to SOTA methods while also enhancing interpretability.



Paperid:2640
Authors:Mbithe Nzomo
University of Cape Town
Abstract:
Noncommunicable diseases are on the rise globally, resulting in accelerated efforts to develop personal health monitoring systems for early detection, prediction, and prevention of diseases. This is part of the vision of precision health, an emerging paradigm that focuses on preventing disease before it strikes by encouraging people to actively monitor and work towards improving their health. A key facilitator of this is the use of wearable sensors that can collect and measure physiological data.Although many sensor-based health monitoring systems have been proposed, interoperability of health data and processes, prediction of future health states, and uncertainty management remain open challenges. This research aims to alleviate these challenges through the development of a reusable framework integrating both data-driven and knowledge-driven AI within a hybrid AI architecture.



Paperid:2641
Authors:Elizabeth Akinyi Ondula
University of Southern California
Abstract:
My research integrates stochastic epidemic models with reinforcement learning to develop effective strategies or policies to inform operational decisions. The objective is to refine policies that are attuned to diverse outbreak dynamics and to offer a tool for informed planning in realworld settings.



Paperid:2642
Authors:Md Maklachur Rahman
Texas A&M University, College Station, TX, USA
Abstract:
Template learning transformer trackers have achieved significant performance improvement recently due to the longdependency learning using the selfattention (SA) mechanism. However, the typical SA mechanisms in transformers adopt a less discriminative design approach which is inadequate for focusing on the most important target information during tracking. Therefore, existing trackers are easily distracted by background information and have constraints in handling tracking challenges. The focus of our research is to develop a target-focused discriminative shallow transformer tracking framework that can learn to distinguish the target from the background and enable accurate tracking with fast speed. Extensive experiments will be performed on several popular benchmarks, including OTB100, UAV123, GOT10k, LaSOT, and TrackingNet, to demonstrate the effectiveness of the proposed framework.



Paperid:2643
Authors:Célian Ringwald
Université Côte d’Azur, Inria, CNRS, I3S
Abstract:
Seqto-seq transformer models have recently been successfully used for relation extraction, showing their flexibility, effectiveness, and scalability on that task. In this context, knowledge graphs aligned with Wikipedia such as DBpedia and Wikidata give us the opportunity to leverage existing texts and corresponding RDF graphs in order to extract, from these texts, the knowledge that is missing in the corresponding graphs and meanwhile improve their coverage. The goal of my thesis is to learn efficient extractors targeting specific RDF patterns and to do so by leveraging the latest language models and the dual base formed by Wikipedia on the one hand, and DBpedia and Wikidata on the other hand.



Paperid:2644
Authors:Deepayan Sanyal
Vanderbilt University
Abstract:
Infants see a selective view of the world: they see some objects with high frequency and from a wide range of viewpoints (e.g., their toys during playing) while a much larger set of objects are seen much more rarely and from limited viewpoints (e.g., objects they see outdoors). Extensive, repeated visual experiences with a small number of objects during infancy plays a big role in the development of human visual skills. Internetstyle datasets that are commonly used in computer vision research do not contain the regularities that result from such repeated, structured experiences with a few objects. This has led to a dearth of models that learn by exploiting these regularities. In my PhD dissertation, I use deep learning models to investigate how regularities in an infant's visual experience can be leveraged for visual representation learning.



Paperid:2645
Authors:Sangwon Seo
Rice University
Abstract:
Effective teamwork translates to fewer preventable errors and higher task performance in collaborative tasks. However, in timecritical tasks, successful teamwork becomes highly challenging to attain. In such settings, often, team members have partial observability of their surroundings, incur high cost of communication, and have trouble estimating the state and intent of their teammates. To assist a team in improving teamwork at task time, my doctoral research proposes an automated task-time team intervention system. Grounded in the notion of shared mental models, the system first detects whether the team is on the same page or not. It then generates effective interventions to improve teamwork. Additionally, by leveraging past demonstrations to learn a model of team behavior, this system minimizes the need for domain experts to specify teamwork models and rules.



Paperid:2646
Authors:Naman Shah
Arizona State University
Abstract:
Although stateof-the-art hierarchical robot planning algorithms allow robots to efficiently compute long-horizon motion plans for achieving user desired tasks, these methods typically rely upon environment-dependent state and action abstractions that need to be hand-designed by experts. On the other hand, non-hierarchical robot planning approaches fail to compute solutions for complex tasks that require reasoning over a long horizon. My research addresses these problems by proposing an approach for learning abstractions and developing hierarchical planners that efficiently use learned abstractions to boost robot planning performance and provide strong guarantees of reliability.



Paperid:2647
Authors:Ke Shen
USC Information Sciences Institute
Abstract:
The advent of powerful transformerbased discriminative language models and, more recently, generative GPT-family models, has led to notable advancements in natural language processing (NLP), particularly in commonsense reasoning tasks. One such task is commonsense reasoning, where performance is usually evaluated through multiple-choice question-answering benchmarks. Till date, many such benchmarks have been proposed and `leaderboards' tracking state-of-the-art performance on those benchmarks suggest that transformer-based models are approaching human-like performance. However, due to documented problems such as hallucination and bias, the research focus is shifting from merely quantifying accuracy on the task to an in-depth, context-sensitive probing of LLMs' generalization and robustness. To gain deeper insight into diagnosing these models' performance in commonsense reasoning scenarios, this thesis addresses three main studies: the generalization ability of transformer-based language models on commonsense reasoning, the trend in confidence distribution of these language models confronted with ambiguous inference tasks, and a proposed risk-centric evaluation framework for both discriminative and generative language models.



Paperid:2648
Authors:Nisha Simon
Queen's University
Abstract:
Humans have been using stories to entertain, educate, and persuade audiences for centuries. The advent of modern AI tools in the form of Large Language Models (LLMs) such as chatGPT continues to fulfill this purpose. However while recent work has shown that LLMs can successfully be used for narrative generation, they lack coherence and can be prone to repetition and stilted language. Automated Planning can therefore be combined with Natural Language text generation to create narratives (stories) that are logical, coherent, and believable. A planning model provides scaffolding to an LLM so that the LLM's language generation is contextdependent, in order to allow users to create more coherent, logical, and believable stories in a variety of domains.



Paperid:2649
Authors:Aaquib Tabrez
University of Colorado Boulder
Abstract:
Policy explanation, a process for describing the behavior of an autonomous system, plays a crucial role in effectively conveying an agent's decisionmaking rationale to human collaborators and is essential for safe real-world deployments. It becomes even more critical in effective human-robot teaming, where good communication allows teams to adapt and improvise successfully during uncertain situations by enabling value alignment within the teams. This thesis proposal focuses on improving human-machine teaming by developing novel human-centered explainable AI (xAI) techniques that empower autonomous agents to communicate their capabilities and limitations via multiple modalities, teach and influence human teammates' behavior as decision-support systems, and effectively build and manage trust in HRI systems.



Paperid:2650
Authors:Fiona Anting Tan
Institute of Data Science, National University of Singapore
Abstract:
Causality expresses the relation between two arguments, one of which represents the cause and the other the effect (or consequence). Causal text mining refers to the extraction and usage of causal information from text. Given an input sequence, we are interested to know if and where causal information occurs. My research is focused on the endto-end challenges of causal text mining. This involves extracting, representing, and applying causal knowledge from unstructured text. The corresponding research questions are: (1) How to extract causal information from unstructured text effectively? (2) How to represent extracted causal relationships in a graph that is interpretable and useful for some application? (3) How can we capitalize on extracted causal knowledge for downstream tasks? What tasks or fields will benefit from such knowledge? In this paper, I outline past and on-going works, and highlight future research challenges.



Paperid:2651
Authors:Pulkit Verma
Arizona State University
Abstract:
The vast diversity of internal designs of taskable blackbox AI systems and their nuanced zones of safe functionality make it difficult for a layperson to use them without unintended side effects. My dissertation focuses on developing paradigms that enable a user to assess and understand the limits of an AI system's safe operability. We develop a personalized AI assessment module that lets an AI system execute instruction sequences in simulators and answer queries about these executions. Our results show that such a primitive query-response interface is sufficient to efficiently derive a user-interpretable model of a system's capabilities.



Paperid:2652
Authors:Luisa Werner
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG F-38000 Grenoble, France
Abstract:
The goal of this thesis is to address knowledge graph completion tasks using neurosymbolic methods. Neuro-symbolic methods allow the joint utilization of symbolic information defined as meta-rules in ontologies and knowledge graph embedding methods that represent entities and relations of the graph in a low-dimensional vector space. This approach has the potential to improve the resolution of knowledge graph completion tasks in terms of reliability, interpretability, data-efficiency and robustness.



Paperid:2653
Authors:Yuan Yang
Vanderbilt University
Abstract:
Despite current AI’s humanlike behavior, super efficiency, and unbelievable ability to handle complex games, we still complain that it shows no sign of creativity, originality, or novelty outside its training set, and that it fails to develop new insights into old experience or establish understanding of new experience. In short, it generates content from its training set, but does not invent content. A fundamental reason for this is that current AI is incapable of abstraction and reasoning in an abstract, generalizable, and systematic way. Think, for instance, of what AI systems we can build if we have a base system that can answer this simple question—when two things are the same. Instead of studying these high-level questions, I put my thesis in the context of visual abstract reasoning (VAR), a task widely used in human intelligence tests. A classical example of this task is Raven’s Progressive Matrices (RPM, see Figure 1), a family of intelligence tests that was designed to measure eductive ability, i.e., the ability to make meaning out of confusion and generate high-level, usually nonverbal, schemata which make it easy to handle complexity. A similar concept to eductive ability is fluid intelligence, or the ability to discriminate and perceive complex relationships when no recourse to answers is stored in memory. Whether eductive ability or fluid intelligence, RPM points to the qualities that have been lacking in AI. To explore these qualities in AI, I propose the following research questions.



Paperid:2654
Authors:Adin Aberbach, Mayank Kejriwal, Ke Shen
Information Sciences Institute, University of Southern California, Information Sciences Institute, University of Southern California, Information Sciences Institute, University of Southern California
Abstract:
Entity Resolution (ER) is the problem of algorithmically matching records, mentions, or entries that refer to the same underlying realworld entity. Traditionally, the problem assumes (at most) two datasets, between which records need to be matched. There is considerably less research in ER when k > 2 datasets are involved. The evaluation of such multipartite ER (M-ER) is especially complex, since the usual ER metrics assume (whether implicitly or explicitly) k < 3. This paper takes the first step towards motivating a k-tuple approach for evaluating M-ER. Using standard algorithms and k-tuple versions of metrics like precision and recall, our preliminary results suggest a significant difference compared to aggregated pairwise evaluation, which would first decompose the M-ER problem into independent bipartite problems and then aggregate their metrics. Hence, M-ER may be more challenging and warrant more novel approaches than current decomposition-based pairwise approaches would suggest.



Paperid:2655
Authors:Alaleh Ahmadianshalchi, Syrine Belakaria, Janardhan Rao Doppa
Washington State University, Stanford university, Washington State University
Abstract:
We consider the problem of constrained multiobjective optimization over black-box objectives, with user-defined preferences, with a largely infeasible input space. Our goal is to approximate the optimal Pareto set from the small fraction of feasible inputs. The main challenges include huge design space, multiple objectives, numerous constraints, and rare feasible inputs identified only through expensive experiments. We present PAC-MOO, a novel preference-aware multi-objective Bayesian optimization algorithm to solve this problem. It leverages surrogate models for objectives and constraints to intelligently select the sequence of inputs for evaluation to achieve the target goal.



Paperid:2656
Authors:Amine Barrak
University of Quebec at Chicoutimi, Québec, Canada
Abstract:
Distributed ML addresses challenges from increasing data and model complexities. Peer to peer (P2P) networks in distributed ML offer scalability and fault tolerance. However, they also encounter challenges related to resource consumption, and communication overhead as the number of participating peers grows. This research introduces a novel architecture that combines serverless computing with P2P networks for distributed training. Serverless computing enhances this model with parallel processing and cost effective scalability, suitable for resourceintensive tasks. Preliminary results show that peers can offload expensive computational tasks to serverless platforms. However, their inherent statelessness necessitates strong communication methods, suggesting a pivotal role for databases. To this end, we have enhanced an in memory database to support ML training tasks.



Paperid:2657
Authors:Anthony Bazhenov, Pahan Dewasurendra, Giri Krishnan, Jean Erik Delanois
Del Norte High School, San Diego, CA, Del Norte High School, San Diego, CA, Department of Medicine, University of California San Diego, La Jolla, CA, Department of Computer Science & Engineering, University of California San Diego, La Jolla, CA Department of Medicine, University of California San Diego, La Jolla, CA
Abstract:
The performance of artificial neural networks (ANNs) degrades when training data are limited or imbalanced. In contrast, the human brain can learn quickly from just a few examples. Here, we investigated the role of sleep in improving the performance of ANNs trained with limited data on the MNIST and Fashion MNIST datasets. Sleep was implemented as an unsupervised phase with local Hebbian type learning rules. We found a significant boost in accuracy after the sleep phase for models trained with limited data in the range of 0.510% of total MNIST or Fashion MNIST datasets. When more than 10% of the total data was used, sleep alone had a slight negative impact on performance, but this was remedied by fine-tuning on the original data. This study sheds light on a potential synaptic weight dynamics strategy employed by the brain during sleep to enhance memory performance when training data are limited or imbalanced.



Paperid:2658
Authors:Tuhin Kumar Biswas, Avisek Gupta, Narayan Changder, Redha Taguelmimt, Samir Aknine, Samiran Chattopadhyay, Animesh Dutta
National Institute of Technology Durgapur, West Bengal, India, TCG CREST, Kolkata, West Bengal, India, TCG CREST, Kolkata, West Bengal, India, LIRIS, Lyon 1 University, Lyon, France, LIRIS, Lyon 1 University, Lyon, France, Techno India University, West Bengal, India, National Institute of Technology Durgapur, West Bengal, India
Abstract:
Simultaneous Coalition Structure Generation and Assignment (SCSGA) is an important research problem in multiagent systems. Given n agents and m tasks, the aim of SCSGA is to form m disjoint coalitions of n agents such that between the coalitions and tasks there is a one-to-one mapping, which ensures each coalition is capable of accomplishing the assigned task. SCSGA with Multi-dimensional Features (SCSGA-MF) extends the problem by introducing a d-dimensional vector for each agent and task. We propose a heuristic algorithm called Multiple Distance Metric (MDM) approach to solve SCSGA-MF. Experimental results confirm that MDM produces near optimal solutions, while being feasible for large-scale inputs within a reasonable time frame.



Paperid:2659
Authors:Rickard Brännvall
Computer Science Department, RISE Research Institutes of Sweden Machine Learning Group, Luleå University of Technology, Sweden
Abstract:
To enhance the computational efficiency of quantized Transformers, we replace the dotproduct and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations but maintains much of the core functionality of conventional dot-product attention. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption. Training experiments on four common benchmark tasks show test set prediction scores comparable to those of conventional Transformers with dot-product attention. Our scaling experiments also suggest significant computational savings, both in plaintext and under encryption. In particular, we believe that the ReLU and addition-based attention mechanism introduced in this paper may enable privacy-preserving AI applications operating under homomorphic encryption by avoiding the costly multiplication of encrypted variables.



Paperid:2660
Authors:Yifu Cai, Arvind Srinivasan, Mononito Goswami, Arjun Choudhry, Artur Dubrawski
Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University
Abstract:
Timeseries and text data are prevalent in healthcare and frequently co-exist, yet they are typically modeled in isolation. Even studies that jointly model time-series and text, do so by converting time-series to images or graphs. We hypothesize that explicitly modeling time-series jointly with text can improve tasks such as summarization and question answering for time-series data, which have received little attention so far. To address this gap, we introduce JoLT to jointly learn desired representations from pre-trained time-series and text models. JoLT utilizes a Querying Transformer (Q-Former) to align the time-series and text representations. Our experiments on a large real-world electrocardiography dataset for medical time-series summarization show that JoLT outperforms state-of-the-art image captioning approaches.



Paperid:2661
Authors:Angela Chen, Nicholas Gisolfi, Artur Dubrawski
Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University
Abstract:
Ensuring a machine learning model’s trustworthiness is crucial to prevent potential harm. One way to foster trust is through the formal verification of the model’s adherence to essential design requirements. However, this approach relies on welldefined, application-domain-centric criteria with which to test the model, and such specifications may be cumbersome to collect in practice. We propose a data-driven approach for creating specifications to evaluate a trained model effectively. Implementing this framework allows us to prove that the model will exhibit safe behavior while minimizing the false-positive prediction rate. This strategy enhances predictive accuracy and safety, providing deeper insight into the model’s strengths and weaknesses, and promotes trust through a systematic approach.



Paperid:2662
Authors:Bin Chen, Kai Yang, Wenxin Tai, Zhangtao Cheng, Leyuan Liu, Ting Zhong, Fan Zhou
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry
Abstract:
Temporal knowledge graph reasoning is an essential task that holds immense value in diverse realworld applications. Existing studies mainly focus on leveraging structural and sequential dependencies, excelling in tasks like entity and link prediction. However, they confront a notable interpretability gap in their predictions, a pivotal facet for comprehending model behavior. In this study, we propose an innovative method, LSGAT, which not only exhibits remarkable precision in entity predictions but also enhances interpretability by identifying pivotal historical events influencing event predictions. LSGAT enables concise explanations for prediction outcomes, offering valuable insights into the otherwise enigmatic "black box" reasoning process. Through an exploration of the implications of the most influential events, it facilitates a deeper understanding of the underlying mechanisms governing predictions.



Paperid:2663
Authors:Tianyi Chen, Feiqi Cao, Yihao Ding, Caren Han
University of Sydney, University of Sydney, University of Sydney, University of Sydney University of Western Australia University of Melbourne
Abstract:
With the introduction of large language models, chatbots are becoming more conversational to communicate effectively and capable of handling increasingly complex tasks. To make a chatbot more relatable and engaging, we propose a new language model idea that maps the humanlike personality. In this paper, we propose a systematic Personality-Enhanced Language Model (PELM) approach by using a joint learning mechanism of personality classification and language generation tasks. The proposed PELM leverages a dataset of defined personality typology, Myers-Briggs Type Indicator, and produces a Personality-Enhanced Language Model by using a joint learning and cross-teaching structure consisting of a classification and language modelling to incorporate personalities via both distinctive types and textual information. The results show that PELM can generate better personality-based outputs than baseline models.



Paperid:2664
Authors:Xiaojian Chen, Chuyue Liao, Yanhui Gu, Yafei Li, Jinlan Wang, Yi Chen, Masaru Kitsuregawa
Department of Biomedical Engineering, Johns Hopkins University School of Computer and Electronic Information Science, Nanjing Normal University, School of Computer and Electronic Information Science, Nanjing Normal University, School of Computer and Electronic Information Science, Nanjing Normal University, School of Chemistry and Materials Science, Nanjing Normal University, School of Physics, Southeast University, School of Computer and Electronic Information Science, Nanjing Normal University, Institute of Industrial Science, The University of Tokyo
Abstract:
Matching molecular analogues is a computational chemistry and bioinformatics research issue which is used to identify molecules that are structurally or functionally similar to a target molecule. Recent studies on matching analogous molecules have predominantly concentrated on enhancing effectiveness, often sidelining computational efficiency, particularly in contexts of low computational resources. This oversight poses challenges in many real applications (e.g., drug discovery, catalyst generation and so forth). To tackle this issue, we propose a general strategy named MapLE, aiming to promptly match analogous molecules with low computational resources by multimetrics evaluation. Experimental evaluation conducted on a public biomolecular dataset validates the excellent and efficient performance of the proposed strategy.



Paperid:2665
Authors:Zhuo Chen, Haimei Zhao, Chaoyue Wang, Bo Yuan, Xiu Li
Shenzhen International Graduate School, Tsinghua University, University of Sydney, University of Sydney, University of Queensland, Shenzhen International Graduate School, Tsinghua University
Abstract:
3Daware GANs successfully solve the problem of 3D-consistency generation and furthermore provide a 3D shape of the generated object. However, the application of the volume renderer disturbs the disentanglement of the latent space, which makes it difficult to manipulate 3D-aware GANs and lowers the image quality of style-based generators. In this work, we devise a dual-mapping framework to make the generated images of pretrained 2D StyleGAN consistent in 3D space. We utilize a tri-plane representation to estimate the 3D shape of the generated object and two mapping networks to bridge the latent space of StyleGAN and the 3D tri-plane space. Our method does not alter the parameters of the pretrained generator, which means the interpretability of latent space is preserved for various image manipulations. Experiments show that our method lifts the 3D awareness of pretrained 2D StyleGAN to 3D-aware GANs and outperforms the 3D-aware GANs in controllability and image quality.



Paperid:2666
Authors:Zhuo Chen, Haimei Zhao, Bo Yuan, Xiu Li
Shenzhen International Graduate School, Tsinghua University, University of Sydney, University of Queensland, Shenzhen International Graduate School, Tsinghua University
Abstract:
Multicamera depth estimation has recently garnered significant attention due to its substantial practical implications in the realm of autonomous driving. In this paper, we delve into the task of self-supervised multi-camera depth estimation and propose an innovative framework, STViT, featuring several noteworthy enhancements: 1) we propose a Spatial-Temporal Transformer to comprehensively exploit both local connectivity and the global context of image features, meanwhile learning enriched spatial-temporal cross-view correlations to recover 3D geometry. 2) to alleviate the severe effect of adverse conditions, e.g., rainy weather and nighttime driving, we introduce a GAN-based Adversarial Geometry Regularization Module (AGR) to further constrain the depth estimation with unpaired normal-condition depth maps and prevent the model from being incorrectly trained. Experiments on challenging autonomous driving datasets Nuscenes and DDAD show that our method achieves state-of-the-art performance.



Paperid:2667
Authors:Taoyong Cui, Yuhan Dong
Tsinghua University, Tsinghua University
Abstract:
Graph neural networks (GNNs) have attracted significant interest recently since they can effectively process and analyze graphstructured data commonly found in real-world applications. However, the predicament that GNNs are difficult to train becomes worse as the layers increase. The essence of this problem is that stacking layers will reduce the stability of forward propagation and gradient back-propagation. And as the increasing scale of models (measured by the number of parameters), how to efficiently and effectively adapt it to particular downstream tasks becomes an intriguing research issue. In this work, motivated by the effect of orthogonality constraints, we propose a simple orthogonal training framework to impose the orthogonality constraints on GNNs, which can help models find a solution vector in a specific low dimensional subspace and stabilize the signaling processes at both the forward and backward directions. Specifically, we propose a novel polar decomposition-based orthogonal initialization (PDOI-R) algorithm, which can identify the low intrinsic dimension within the Stiefel Manifold and stabilize the training process. Extensive experiments demonstrate the effectiveness of the proposed method in multiple downstream tasks, showcasing its generality. The simple method can help existing state-of-the-art models achieve better performance.



Paperid:2668
Authors:Taoyong Cui, Yuhan Dong
Tsinghua University, Tsinghua University
Abstract:
Image/video denoising in lowlight scenes is an extremely challenging problem due to limited photon count and high noise. In this paper, we propose a novel approach with contrastive learning to address this issue. Inspired by the success of contrastive learning used in some high-level computer vision tasks, we bring in this idea to the low-level denoising task. In order to achieve this goal, we introduce a new denoising contrastive regularization (DCR) to exploit the information of noisy images and clean images. In the feature space, DCR makes the denoised image closer to the clean image and far away from the noisy image. In addition, we build a new feature embedding network called Wnet, which is more effective to extract high-frequency information. We conduct the experiments on a real low-light dataset that captures still images taken on a moonless clear night in 0.6 millilux and videos under starlight (no moon present). The results show that our method can achieve a higher PSNR and better visual quality compared with existing methods.



Paperid:2669
Authors:Luca D'Amico-Wong, Gary Qiurui Ma, David Parkes
Harvard College, Harvard University, Harvard University
Abstract:
We consider a platform in a twosided market with unit-supply sellers and unit-demand buyers. Each buyer can transact with a subset of sellers it knows off platform and another seller that the platform recommends. Given the choice of sellers, transactions and prices form a competitive equilibrium. The platform selects one seller for each buyer, and charges a fixed percentage of prices to all transactions that it recommends. The platform seeks to maximize total revenue. We show that the platform's problem is NP-hard, even when each buyer knows at most two buyers off platform. Finally, when each buyer values all sellers equally and knows only one buyer off platform, we provide a polynomial time algorithm that optimally solves the problem.



Paperid:2670
Authors:Narjes Delpisheh, Yllias Chali
University of Lethbridge, University of Lethbridge
Abstract:
ive text summarization uses the summarizer’s own words to capture the main information of a source document in a summary. While it is more challenging to automate than extractive text summarization, recent advancements in deep learning approaches and pretrained language models have improved its performance. However, abstractive text summarization still has issues such as unfaithfulness. To address this problem, we propose a new approach that utilizes important Elementary Discourse Units (EDUs) to guide BART-based text summarization. Our approach showed the improvement in truthfulness and source document coverage in comparison to some previous studies.



Paperid:2671
Authors:Junzhe Ding, Yufei Que, Jin Zhang, Cheng Wu
School of Rail Transportation, Soochow University, School of Rail Transportation, Soochow University, School of Rail Transportation, Soochow University, School of Rail Transportation, Soochow University
Abstract:
It is necessary to explore an effective point cloud completion mechanism that is of great significance for realworld tasks such as autonomous driving, robotics applications, and multi-target tracking. In this paper, we propose a point cloud completion method using a self-supervised transformer model based on the contextual constraints of scene flow. Our method uses the multi-frame point cloud context relationship as a guide to generate a series of token proposals, this priori condition ensures the stability of the point cloud completion. The experimental results show that the method proposed in this paper achieves high accuracy and good stability.



Paperid:2672
Authors:Shane Donnelly, Ayan Dutta
University of North Florida, University of North Florida
Abstract:
An exoplanet is a planet, which is not a part of our solar system. Whether life exists in one or more of these exoplanets has fascinated humans for centuries. NASA’s Kepler Space Telescope has discovered more than 70% of known exoplanets in our universe. However, manually determining whether a Kepler light curve indicates an exoplanet or not becomes infeasible with the large volume of data. Due to this, we propose a deep learningbased strategy to automatically classify a Kepler light curve. More specifically, we first convert the light curve time series into its corresponding Markov Transition Field (MTF) image and then classify it. Results show that the accuracy of the proposed technique is 99.39%, which is higher than all current state-of-the-art approaches.



Paperid:2673
Authors:Danilo Dordevic, Vukasin Bozic, Joseph Thommes, Daniele Coppola, Sidak Pal Singh
ETH Zurich, ETH Zurich, ETH Zurich, ETH Zurich, ETH Zürich
Abstract:
This work presents an analysis of the effectiveness of using standard shallow feedforward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these ”attentionless Transformers” to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.



Paperid:2674
Authors:Conor Duggan, Zhengyu Li, Curtis Bright, Vijay Ganesh
University of Waterloo, Georgia Institute of Technology, University of Windsor, Georgia Institute of Technology
Abstract:
The Ramsey problem R(3,8) asks for the smallest n such that every red/blue coloring of the complete graph on n vertices must contain either a blue triangle or a red 8clique. We provide the first certifiable proof that R(3,8) = 28, automatically generated by a combination of Boolean satisfiability (SAT) solver and a computer algebra system (CAS). This SAT+CAS combination is significantly faster than a SAT-only approach. While the R(3,8) problem was first computationally solved by McKay and Min in 1992, it was not a verifiable proof. The SAT+CAS method that we use for our proof is very general and can be applied to a wide variety of combinatorial problems.



Paperid:2675
Authors:Eric Enouen, Sebastian Caldas, Mononito Goswami, Artur Dubrawski
The Ohio State University Auton Lab, School of Computer Science, Carnegie Mellon University, Princeton University Auton Lab, School of Computer Science, Carnegie Mellon University, Auton Lab, School of Computer Science, Carnegie Mellon University, Auton Lab, School of Computer Science, Carnegie Mellon University
Abstract:
Federated Learning is an effective approach for learning from data distributed across multiple institutions. While most existing studies are aimed at improving predictive accuracy of models, little work has been done to explain knowledge differences between institutions and the benefits of collaboration. Understanding these differences is critical in crosssilo federated learning domains, e.g., in healthcare or banking, where each institution or silo has a different underlying distribution and stakeholders want to understand how their institution compares to their partners. We introduce Prototype-Informed Cross-Silo Router (PICSR) which utilizes a mixture of experts approach to combine local models derived from multiple silos. Furthermore, by computing data similarity to prototypical samples from each silo, we are able to ground the router’s predictions in the underlying dataset distributions. Experiments on a real-world heart disease prediction dataset show that PICSR retains high performance while enabling further explanations on the differences among institutions compared to a single black-box model.



Paperid:2676
Authors:Yimeng Fan, Pedram Agand, Mo Chen, Edward J. Park, Allison Kennedy, Chanwoo Bae
Simon Fraser University, Simon Fraser University, Simon Fraser University, Simon Fraser University, National Research Council Canada, BC Ferry Inc.
Abstract:
The maritime industry's continuous commitment to sustainability has led to a dedicated exploration of methods to reduce vessel fuel consumption. This paper undertakes this challenge through a machine learning approach, leveraging a realworld dataset spanning two years of a passenger vessel in west coast Canada. Our focus centers on the creation of a time series forecasting model given the dynamic and static states, actions, and disturbances. This model is designed to predict dynamic states based on the actions provided, subsequently serving as an evaluative tool to assess the proficiency of the vessel's operation under the captain's guidance. Additionally, it lays the foundation for future optimization algorithms, providing valuable feedback on decision-making processes. To facilitate future studies, our code is available at https://github.com/pagand/model_optimze_vessel/tree/AAAI.



Paperid:2677
Authors:Wancheng Feng, Yingchao Liu, Jiaming Pei, Wenxuan Liu, Chunpeng Tian, Lukun Wang
Shandong University of Science and Technology, Shandong University of Science and Technology, University of Sydney, Shandong University of Science and Technology, Shandong University of Science and Technology, Shandong University of Science and Technology
Abstract:
Face video stylization aims to convert real face videos into specified reference styles. While oneshot methods perform well in single-image stylization, ensuring continuity between frames and retaining the original facial expressions present challenges in video stylization. To address these issues, our approach employs a personalized diffusion model with pixel-level control. We propose Local Consistency Guidance(LCG) strategy, composed of local-cross attention and local style transfer, to ensure temporal consistency. This framework enables the synthesis of high-quality stylized face videos with excellent temporal continuity.



Paperid:2678
Authors:Grant C. Forbes, David L. Roberts
North Carolina State University, North Carolina State University
Abstract:
Recently there has been a proliferation of intrinsic motivation (IM) reward shaping methods to learn in complex and sparsereward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we show preserves the set of optimal policies under a more general set of functions than has been previously demonstrated. We also present Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that are useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey environment, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training.



Paperid:2679
Authors:Hongzhu Fu, Fan Zhou, Qing Guo, Qiang Gao
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry, School of Computer Science and Technology, Beijing Institute of Technology, Southwestern University of Finance and Economics Kash Institute of Electronics and Information Industry
Abstract:
Crime prediction stands as a pivotal concern within the realm of urban management due to its potential threats to public safety. While prior research has predominantly focused on unraveling the intricate dependencies among urban regions and temporal dynamics, the challenges posed by the scarcity and uncertainty of historical crime data have not been thoroughly investigated. This study introduces an innovative spatialtemporal augmented learning framework for crime prediction, namely STAug. In STAug, we devise a CrimeMix to improve the ability of generalization. Furthermore, we harness a spatial-temporal aggregation to capture and incorporate multiple correlations covering the temporal, spatial, and crime-type aspects. Experiments on two real-world datasets underscore the superiority of STAug over several baselines.



Paperid:2680
Authors:Shaz Furniturewala, Surgan Jandial, Abhinav Java, Simra Shahid, Pragyan Banerjee, Balaji Krishnamurthy, Sumit Bhatia, Kokil Jaidka
Birla Institute of Technology and Science Pilani, Pilani, MDSR Labs Adobe, MDSR Labs, Adobe, MDSR Labs Adobe, Birla Institute of Technology and Science Pilani, Pilani, MDSR Labs Adobe, MDSR Labs Adobe, National University of Singapore
Abstract:
Achieving fairness in Large Language Models (LLMs) continues to pose a persistent challenge, as these models are prone to inheriting biases from their training data, which can subsequently impact their performance in various applications. There is a need to systematically explore whether structured prompting techniques can offer opportunities for debiased text generation by LLMs. In this work, we designed an evaluative framework to test the efficacy of different prompting techniques for debiasing text along different dimensions. We aim to devise a general structured prompting approach to achieve fairness that generalizes well to different texts and LLMs.



Paperid:2681
Authors:Alessio Galatolo, Katie Winkle
Uppsala University, Uppsala University
Abstract:
The Transformer architecture has seen a lot of attention in recent years also thanks to its ability to scale well and allow massive parallelism during training. This has made possible the development of Language Models (LMs) of increasing size and the discovery of latent abilities that completely outclass traditional methods e.g. rulebased systems. However, they also introduced new issues, like their inability to retain the history of previous interactions due to their stateless nature or the difficulty in controlling their generation. Different attempts have been made to address these issues, e.g. a `brute force' approach to solving the memory issue is to include the full conversation history in the context window, a solution that is limited by the quadratic scalability of Transformers. In this work, we explore computationally practical solutions to the memory problem. We propose to augment the decoder-only architecture of (most) Large LMs with a (relatively small) memory encoder. Its output is prepended to the decoder's input in a similar fashion to recent works in Adapters and the original Transformer architecture. Initial experiments show promising results, however future work is needed to compare with State-of-the-Art methods.



Paperid:2682
Authors:Agasthya Gangavarapu
Eastlake High School
Abstract:
The COVID19 pandemic has exacerbated the challenges faced by healthcare delivery in developing nations, placing additional strain on already fragile infrastructure and healthcare systems. This has prompted an increased reliance on lay healthcare workers (LHWs) to meet the surging demand for services. Due to limited formal training, many LHWs have resorted to using unreliable sources, such as internet searches, to access medical information. Large language models (LLMs) offer a promising opportunity to support LHWs by providing accurate, context-sensitive information for improving healthcare delivery, provided they are appropriately fine-tuned on domain-specific multilingual data. This paper delves into critical issues and presents potential solutions for developing LLM-powered virtual assistants tailored to LHWs serving Telugu and Hindi-speaking populations. Key focal points include the customization of language and content to suit local contexts, the integration of feedback mechanisms to continuously enhance assistance quality, and the delicate balance between automation and human oversight.



Paperid:2683
Authors:Liyuan Gao, Matthew Zhang, Victor S. Sheng
Texas Tech University, Baylor University, Texas Tech University
Abstract:
Transcription factors (TFs) play a fundamental role in gene regulation by selectively binding to specific DNA sequences. Understanding the nature and behavior of these TFs is essential for insights into gene regulation dynamics. In this study, we introduce a robust multitask learning framework specifically tailored to harness both TF-specific annotations and TF-related domain annotations, thereby enhancing the accuracy of TF predictions. Notably, we incorporate cutting-edge language models that have recently garnered attention for their outstanding performance across various fields, particularly in biological computations like protein sequence modeling. Comparative experimental analysis with existing models, DeepTFactor and TFpredict, reveals that our multi-task learning framework achieves an accuracy exceeding 92% across four evaluation metrics on the TF prediction task, surpassing both competitors. Our work marks a significant leap in the domain of TF prediction, enriching our comprehension of gene regulatory mechanisms and paving the way for the discovery of novel regulatory motifs.



Paperid:2684
Authors:Jaidev Gill, Vala Vakilian, Christos Thrampoulidis
University of British Columbia, University of British Columbia, University of British Columbia
Abstract:
Supervisedcontrastive loss (SCL) is an alternative to cross-entropy (CE) for classification tasks that makes use of similarities in the embedding space to allow for richer representations. Previous works have used trainable prototypes to help improve test accuracy of SCL when training under imbalance. In this work, we propose the use of fixed prototypes to help engineering the feature geometry when training with SCL. We gain further insights by considering a limiting scenario where the number of prototypes far outnumber the original batch size. Through this, we establish a connection to CE loss with a fixed classifier and normalized embeddings. We validate our findings by conducting a series of experiments with deep neural networks on benchmark vision datasets.



Paperid:2685
Authors:Zihan Guan, Mengxuan Hu, Zhongliang Zhou, Jielu Zhang, Sheng Li, Ninghao Liu
The University of Virginia, The University of Virginia, The University of Georgia, The University of Georgia, The University of Virginia, The University of Georgia
Abstract:
Image segmentation is foundational to computer vision applications, and the Segment Anything Model (SAM) has become a leading base model for these tasks. However, SAM falters in specialized downstream challenges, leading to various customized SAM models. We introduce BadSAM, a backdoor attack tailored for SAM, revealing that customized models can harbor malicious behaviors. Using the CAMO dataset, we confirm BadSAM's efficacy and identify SAM vulnerabilities. This study paves the way for the development of more secure and customizable vision foundation models.



Paperid:2686
Authors:Julien Guité-Vinet, Alexandre Blondin Massé, Fatiha Sadat
Université du Québec à Montréal, Université du Québec à Montréal, Université du Québec à Montréal
Abstract:
In the last years, several variants of transformers have emerged. In this paper, we compare different transformerbased models for solving the reverse dictionary task and explore their use in the context of a serious game called The Dictionary Game.



Paperid:2687
Authors:Aarushi Gupta, Yuexing Hao, Yuting Yang, Tiancheng Yuan, Matthias Wieland, Parminder S. Basran, Ken Birman
Cornell University, Cornell University, Cornell University, Cornell University, Cornell University, Cornell University, Cornell University
Abstract:
Dairy owners invest heavily to keep their animals healthy. There is good reason to hope that technologies such as computer vision and artificial intelligence (AI) could reduce costs, yet obstacles arise when adapting these advanced tools to farming environments. In this work, we applied AI tools to dairy cow teat localization and teat shape classification, obtaining a model that achieves a mean average precision of 0.783. This digital twindriven approach is intended as a first step towards automating and accelerating the detection and treatment of hyperkeratosis, mastitis, and other medical conditions that significantly burden the dairy industry.



Paperid:2688
Authors:Ruiqi He, Carlos G. Correa, Tom L. Griffiths, Mark K. Ho
Max Planck Institute for Intelligent Systems, Tübingen, Princeton Neuroscience Institute, Princeton University, Princeton, New Jersey, Department of Psychology, Princeton University, Princeton, New Jersey Department of Computer Science, Princeton University, Princeton, New Jersey, Department of Computer Science, Stevens Institute of Technology, Hoboken, New Jersey
Abstract:
How are people able to plan so efficiently despite limited cognitive resources? We aimed to answer this question by extending an existing model of human task decomposition that can explain a wide range of simple planning problems by adding structure information to the task to facilitate planning in more complex tasks. The extended model was then applied to a more complex planning domain of spatial navigation. Our results suggest that our framework can correctly predict the navigation strategies of the majority of the participants in an online experiment.



Paperid:2689
Authors:Jinyu Hong, Ping Kuang, Qiang Gao, Fan Zhou
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, Southwestern University of Finance and Economics Kash Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry
Abstract:
In recent intelligent transportation applications, metro flow forecasting has received much attention from researchers. Most prior arts endeavor to explore spatial or temporal dependencies while ignoring the key characteristic patterns underlying historical flows, e.g., trend and periodicity. Although the multiple granularity distillations or spatial dependency correlation can promote the flow estimation. However, the potential noise and spatial dynamics are underexplored. To this end, we propose a novel Disentanglement-Guided Spatial-Temporal Graph Neural Network or DGST to address the above concerns. It contains a Disentanglement Pre-training procedure for characteristic pattern disentanglement learning, a Characteristic Pattern Prediction for different future characteristic explorations, and a Spatial-Temporal Correlation for spatial-temporal dynamic learning. Experiments on a real-world dataset demonstrate the superiority of our DGST.



Paperid:2690
Authors:Madhav Hota, Adel Khorramrouz, Ashiqur R. KhudaBukhsh
Illinois Mathematics and Science Academy, Rochester Institute of Technology, Rochester Institute of Technology
Abstract:
On 15 January 2022, noted tennis player Novak Djokovic was deported from Australia due to his unvaccinated status for the COVID19 vaccine. This paper presents a stance classifier and evaluates public reaction to this episode and the impact of this behavior on social media discourse on YouTube. We observed a significant spike of individuals who supported and opposed his behavior at the time of the episode. Supporters outnumbered those who opposed this behavior by over 4x. Our study reports a disturbing trend that following every major Djokovic win, even now, vaccine skeptics often conflate his tennis success as a fitting reply to vaccine mandates.



Paperid:2691
Authors:Yanlong Huang, Yue Lei, Wenxin Tai, Zhangtao Cheng, Ting Zhong, Kunpeng Zhang
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry, University of Maryland, College Park
Abstract:
Earnings call transcripts hold valuable insights that are vital for investors and analysts when making informed decisions. However, extracting these insights from lengthy and complex transcripts can be a challenging task. The traditional manual examination is not only timeconsuming but also prone to errors and biases. Deep learning-based representation learning methods have emerged as promising and automated approaches to tackle this problem. Nevertheless, they may encounter significant challenges, such as the unreliability of the representation encoding process and certain domain-specific requirements in the context of finance. To address these issues, we propose a novel transcript representation learning model. Our model leverages the structural information of transcripts to effectively extract key insights, while endowing model with explainability via variational information bottleneck. Extensive experiments on two downstream financial tasks demonstrate the effectiveness of our approach.



Paperid:2692
Authors:Piyush Jha, Joseph Scott, Jaya Sriram Ganeshna, Mudit Singh, Vijay Ganesh
Georgia Institute of Technology, University of Waterloo, University of Waterloo, University of Waterloo, Georgia Institute of Technology
Abstract:
We present a novel tool BertRLFuzzer, a BERT and Reinforcement Learning (RL) based fuzzer aimed at finding security vulnerabilities for Web applications. BertRLFuzzer works as follows: given a set of seed inputs, the fuzzer performs grammaradhering and attack-provoking mutation operations on them to generate candidate attack vectors. The key insight of BertRLFuzzer is the use of RL with a BERT model as an agent to guide the fuzzer to efficiently learn grammar-adhering and attack-provoking mutation operators. In order to establish the efficacy of BertRLFuzzer we compare it against a total of 13 black box and white box fuzzers over a benchmark of 9 victim websites with over 16K LOC. We observed a significant improvement, relative to the nearest competing tool in terms of time to first attack (54% less), new vulnerabilities found (17 new vulnerabilities), and attack rate (4.4% more attack vectors generated).



Paperid:2693
Authors:Yixuan Jin, Yutao Wei, Zhangtao Cheng, Wenxin Tai, Chunjing Xiao, Ting Zhong
University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China Kashi Institute of Electronics and Information Industry, Kashgar 844000, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China Kashi Institute of Electronics and Information Industry, Kashgar 844000, China, Henan University, Kaifeng, Henan 475000, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China Kashi Institute of Electronics and Information Industry, Kashgar 844000, China
Abstract:
The success of graph neural networks (GNNs) has spurred numerous new works leveraging GNNs for modeling multivariate time series anomaly detection. Despite their achieved performance improvements, most of them only consider static graph to describe the spatialtemporal dependencies between time series. Moreover, existing works neglect the time and scale-changing structures of time series. In this work, we propose MDGAD, a novel multi-scale dynamic graph structure learning approach for time series anomaly detection. We design a multi-scale graph structure learning module that captures the complex correlations among time series, constructing an evolving graph at each scale. Meanwhile, an anomaly detector is used to combine bilateral prediction errors to detect abnormal data. Experiments conducted on two time series datasets demonstrate the effectiveness of MDGAD.



Paperid:2694
Authors:Amelia Jobe, Richard Ky, Sandra Luo, Akshay Dhamsania, Sumit Purohit, Edoardo Serra
Boise State University, San Jose State University, University of Texas at Dallas, Texas A&M University, Pacific Northwest National Laboratory, Boise State University
Abstract:
Cyberattacks on power grids pose significant risks to national security. Power grid attacks typically lead to abnormal readings in power output, frequency, current, and voltage. Due to the interconnected structure of power grids, abnormalities can spread throughout the system and cause widespread power outages if not detected and dealt with promptly. Our research proposes a novel anomaly detection system for power grids that prevents overfitting. We created a network graph to represent the structure of the power grid, where nodes represent power grid components like generators and edges represent connections between nodes such as overhead power lines. We combine the capabilities of Long ShortTerm Memory (LSTM) models with a Graph Isomorphism Network (GIN) in a hybrid model to pinpoint anomalies in the grid. We train our model on each category of nodes that serves a similar structural purpose to prevent overfitting of the model. We then assign each node in the graph a unique signature using a GIN. Our model achieved a 99.92% accuracy rate, which is significantly higher than a version of our model without structural encoding, which had an accuracy level of 97.30%. Our model allows us to capture structural and temporal components of power grids and develop an attack detection system with high accuracy without overfitting.



Paperid:2695
Authors:Jinsun Jung, Hyeoneui Kim
College of Nursing, Seoul National University Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project, College of Nursing, Seoul National University Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project The Research Institute of Nursing Science
Abstract:
Explainable Artificial Intelligence (XAI), a promising future technology in the field of healthcare, has attracted significant interest. Despite ongoing efforts in the development of XAI approaches, there has been inadequate evaluation of explanation effectiveness and no standardized framework for the evaluation has been established. This study aims to examine the relationship between subjective interpretability and perceived plausibility for various XAI explanations and to determine the factors affecting users' acceptance of the XAI explanation.



Paperid:2696
Authors:Eul Ka, Seungeun Go, Minjin Kwak, JeongHun Kim, Aziz Nasridinov
Chungbuk National University, Chungbuk National University, Chungbuk National University, Bigdata Research Institute, Chungbuk National University, Chungbuk National University
Abstract:
Solar power generation has recently been in the spotlight as global warming continues to worsen. However, two significant problems may hinder solar power generation, considering that solar panels are installed outside. The first is soiling, which accumulates on solar panels, and the second is a decrease in sunlight owing to bad weather. In this paper, we will demonstrate that the solar power generation forecasting can increase when considering soiling and sunlight information. We first introduce a dataset containing images of clean and soiled solar panels, sky images, and weather information. For accurate solar power generation forecasting, we propose a new multimodal model that aggregates various features related to weather, soiling, and sunlight. The experimental results demonstrated the high accuracy of our proposed multimodal model.



Paperid:2697
Authors:Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani Seneviratne
Rensselaer Polytechnic Institute, IBM Research, IBM Research, IBM Research, Rensselaer Polytechnic Institute
Abstract:
Data distillation is a technique of reducing a large dataset into a smaller dataset. The smaller dataset can then be used to train a model which can perform comparably to a model trained on the full dataset. Past works have examined this approach for image datasets, focusing on neural networks as target models. However, tabular datasets pose new challenges not seen in images. A sample in tabular dataset is a one dimensional vector unlike the two (or three) dimensional pixel grid of images, and NonNN models such as XGBoost can often outperform neural network (NN) based models. Our contribution in this work is two-fold: 1) We show in our work that data distillation methods from images do not translate directly to tabular data; 2) We propose a new distillation method that consistently outperforms the baseline for multiple different models, including non-NN models such as XGBoost.



Paperid:2698
Authors:Seung Woo Kang, Ohyun Jo
Chungbuk National University, Department of Computer Science, 28644, Cheongju, Republic of Korea, Chungbuk National University, Department of Computer Science, 28644, Cheongju, Republic of Korea
Abstract:
We present an imagification approach for multivariate timeseries data tailored to constrained NN-based forecasting model training environments. Our imagification process consists of two key steps: Re-stacking and time embedding. In the Re-stacking stage, time-series data are arranged based on high correlation, forming the first image channel using a sliding window technique. The time embedding stage adds two additional image channels by incorporating real-time information. We evaluate our method by comparing it with three benchmark imagification techniques using a simple CNN-based model. Additionally, we conduct a comparison with LSTM, a conventional time-series forecasting model. Experimental results demonstrate that our proposed approach achieves three times faster model training termination while maintaining forecasting accuracy.



Paperid:2699
Authors:Yash Kankariya, Suguman Bansal
Georgia Institute of Technology, Georgia Institute of Technology
Abstract:
Prior compositional methods in LTLf to DFA conversion have focussed on improving the composition phase. In this work, we examine improvements to the decomposition phase that result in overall improvements in LTLf to DFA translation. Our work is based on reducing the structure of the underlying Syntax Tree (AST) of a formula such that the new AST results in fewer composition operations.



Paperid:2700
Authors:Aditya Kasliwal, Aryan Kamani, Ishaan Gakhar, Pratinav Seth, Sriya Rallabandi
Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
Abstract:
Recent advances in imageto-image translation involve the integration of non-visual imagery in deep models. Non-visual sensors, although more costly, often produce low-resolution images. To combat this, methods using RGB images to enhance the resolution of these modalities have been introduced. Fusing these modalities to achieve high-resolution results demands models with millions of parameters and extended inference times. We present LaMAR, a lightweight model. It employs Laplacian image pyramids combined with a low-resolution thermal image for Guided Thermal Super Resolution. By decomposing the RGB image into a Laplacian pyramid, LaMAR preserves image details and avoids high-resolution feature map computations, ensuring efficiency. With faster inference times and fewer parameters, our model demonstrates state-of-the-art results.



Paperid:2701
Authors:Jongseok Kim, Ohyun Jo
Chungbuk National University, Cheongju, Republic of Korea, Chungbuk National University, Cheongju, Republic of Korea
Abstract:
This work proposes and analyzes IncepSeqNet which is a new model combining the Inception Module with the innovative MultiShape Augmentation technique. IncepSeqNet excels in feature extraction from sequence signal data consisting of a number of complex numbers to achieve superior classification accuracy across various SNR(Signal-to-Noise Ratio) environments. Experimental results demonstrate IncepSeqNet’s outperformance of existing models, particularly at low SNR levels. Furthermore, we have confirmed its applicability in practical 5G systems by using real-world signal data.



Paperid:2702
Authors:Taeyoung Kim, Dongsoo Har
Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology
Abstract:
In multigoal reinforcement learning with a sparse binary reward, training agents is particularly challenging, due to a lack of successful experiences. To solve this problem, hindsight experience replay (HER) generates successful experiences even from unsuccessful ones. However, generating successful experiences from uniformly sampled ones is not an efficient process. In this paper, the impact of exploiting the property of achieved goals in generating successful experiences is investigated and a novel cluster-based sampling strategy is proposed. The proposed sampling strategy groups episodes with different achieved goals by using a cluster model and samples experiences in the manner of HER to create the training batch. The proposed method is validated by experiments with three robotic control tasks of the OpenAI Gym. The results of experiments demonstrate that the proposed method is substantially sample efficient and achieves better performance than baseline approaches.



Paperid:2703
Authors:Rui Kong, Chenyang Wu, Zongzhang Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Abstract:
Current policy gradient techniques excel in refining policies over sampled states but falter when generalizing to unseen states. To address this, we introduce Reinforcement Sampling (RS), a novel method leveraging a generalizable action value function to sample improved decisions. RS is able to improve the decision quality whenever the action value estimation is accurate. It works by improving the agent's decision on the fly on the states the agent is visiting. Compared with the historically experienced states in which conventional policy gradient methods improve the policy, the currently visited states are more relevant to the agent. Our method sufficiently exploits the generalizability of the value function on unseen states and sheds new light on the future development of generalizable reinforcement learning.



Paperid:2704
Authors:Ryan Koo, Yekyung Kim, Dongyeop Kang, Jaehyung Kim
University of Minnesota, Hyundai Motor Group, University of Minnesota, Korea Advanced Institute of Science & Technology
Abstract:
Detecting outof-distribution (OOD) samples is crucial for robust NLP models. Recent works observe two OOD types: background shifts (style change) and semantic shifts (content change), but existing detection methods vary in effectiveness for each type. To this end, we propose Meta-Crafting, a unified OOD detection method by constructing a new discriminative feature space utilizing 7 model-driven metadata chosen empirically that well detects both types of shifts. Our experimental results demonstrate state-of-the-art robustness to both shifts and significantly improved detection on stress datasets.



Paperid:2705
Authors:Daya Kumar, Abhijith Sharma, Apurva Narayan
Western University, London, ON, Canada, University of British Columbia, Kelowna, BC, Canada, Western University, London, ON, Canada University of British Columbia, Kelowna, BC, Canada
Abstract:
Convolutional neural networks (CNNs) are being increasingly adopted in medical imaging. However, in the race for developing accurate models, their robustness is often overlooked. This elicits a significant concern given the safetycritical nature of the healthcare system. Here, we highlight the vulnerability of CNNs against a sporadic and naturalistic adversarial patch attack (SNAP). We train SNAP to mislead the ResNet50 model predicting metastasis in histopathological scans of lymph node sections, lowering the accuracy by 27%. This work emphasizes the need for defense strategies before deploying CNNs in critical healthcare settings.



Paperid:2706
Authors:Shivanand Kundargi, Tejas Anvekar, Ramesh Tabib, Uma Mudenagudi
Center of Excellence in Visual Intelligence (CEVI), KLE Technological University, Vidyanagar, Hubballi, Karnataka, India, Center of Excellence in Visual Intelligence (CEVI), KLE Technological University, Vidyanagar, Hubballi, Karnataka, India, Center of Excellence in Visual Intelligence (CEVI), KLE Technological University, Vidyanagar, Hubballi, Karnataka, India, Center of Excellence in Visual Intelligence (CEVI), KLE Technological University, Vidyanagar, Hubballi, Karnataka, India
Abstract:
Neural Radiance Fields (NeRF) have been extensively explored as a leading approach for modeling and representing 3D data across various domains. Their ability to capture arbitrary scale point clouds and generate novel views makes them particularly valuable for digitizing cultural heritage sites. However, despite their impressive rendering capabilities, prior methods have often overlooked a significant realworld challenge: handling open-world scenarios characterized by unstructured data containing multiple classes in a single set of unlabeled images. To address this challenge, we propose a novel method NCD-NeRF that leverages Novel-Class Discovery to effectively tackle the complexities inherent in real-world data with unlabeled classes while excelling in producing high-quality NeRF representation. To validate our approach, we conducted a benchmarking analysis using a custom-collected dataset featuring UNESCO World Heritage sites in India. We observe that our proposed NCD-NeRF can parallely discover novel classes and render high-quality 3D volumes.



Paperid:2707
Authors:Mu-Tien Kuo, Chih-Chung Hsueh, Richard Tzong-Han Tsai
Chingshin Academy Research Center for Humanities and Social Sciences, Academia Sinica, Chingshin Academy Research Center for Humanities and Social Sciences, Academia Sinica, Dept. of Computer Science and Engineering, National Central University, Taiwan Research Center for Humanities and Social Sciences, Academia Sinica
Abstract:
As Large Language Models (LLMs) become more prevalent in various fields, it is crucial to rigorously assess the quality of their explanations. Our research introduces a taskagnostic framework for evaluating free-text rationales, drawing on insights from both linguistics and machine learning. We evaluate two dimensions of explainability: fidelity and interpretability. For fidelity, we propose methods suitable for proprietary LLMs where direct introspection of internal features is unattainable. For interpretability, we use language models instead of human evaluators, addressing concerns about subjectivity and scalability in evaluations. We apply our framework to evaluate GPT-3.5 and the impact of prompts on the quality of its explanations. In conclusion, our framework streamlines the evaluation of explanations from LLMs, promoting the development of safer models.



Paperid:2708
Authors:Yue Lei, Bin Chen, Wenxin Tai, Ting Zhong, Fan Zhou
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kash Institute of Electronics and Information Industry
Abstract:
Recently, the field of Speech Enhancement has witnessed the success of diffusionbased generative models. However, these diffusion-based methods used to take multiple iterations to generate high-quality samples, leading to high computational costs and inefficiency. In this paper, we propose SDFEN (Shallow Diffusion for Fast spEech eNhancement), a novel approach for addressing the inefficiency problem while enhancing the quality of generated samples by reducing the iterative steps in the reverse process of diffusion method. Specifically, we introduce the shallow diffusion strategy initiating the reverse process with an adaptive time step to accelerate inference. In addition, a dedicated noisy predictor is further proposed to guide the adaptive selection of time step. Experiment results demonstrate the superiority of the proposed SDFEN in effectiveness and efficiency.



Paperid:2709
Authors:Zhengyu Li, Curtis Bright, Vijay Ganesh
Georgia Institute of Technology, University of Windsor, Georgia Institute of Technology
Abstract:
The problem of finding the minimum threedimensional Kochen–Specker (KS) vector system, an important problem in quantum foundations, has remained open for over 55 years. We present a new method to address this problem based on a combination of a Boolean satisfiability (SAT) solver and a computer algebra system (CAS). Our approach improved the lower bound on the size of a KS system from 22 to 24. More importantly, we provide the first computer-verifiable proof certificate of a lower bound to the KS problem with a proof size of 41.6 TiB for order 23. The efficiency is due to the powerful combination of SAT solvers and CAS-based orderly generation.



Paperid:2710
Authors:Jiaxin Liang, Junping Zhou, Minghao Yin
School of Information Science and Technology, Northeast Normal University, School of Information Science and Technology, Northeast Normal University, School of Information Science and Technology, Northeast Normal University Key Laboratory of Applied Statistics of MOE, Northeast Normal University, Changchun, China
Abstract:
The Diversified Topk MaxSAT (DTKMS) problem is an extension of MaxSAT. The objective of DTKMS is to find k feasible assignments of a given formula, such that each assignment satisfies all hard clauses and the k assignments together satisfy the maximum number of soft clauses. This paper presents a local search algorithm, DTKMS-DIA, which incorporates a new approach to generating initial assignments. Experimental results indicate that DTKMS-DIA can achieve attractive performance on 826 instances compared with state-of-the-art solvers.



Paperid:2711
Authors:Hanyue Liu, Haonan Cheng, Long Ye
School of Information and Communication Engineering, Communication University of China, Beijing, China, State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China, State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China School of Data Science and Media Intelligence, Communication University of China, Beijing, China
Abstract:
Existing imageto-image translation methods perform less satisfactorily in the "day-night" domain due to insufficient scene feature study. To address this problem, we propose DNIT, which performs fine-grained handling of features by a nighttime image preprocessing (NIP) module and an edge fusion detection (EFD) module. The NIP module enhances brightness while minimizing noise, facilitating extracting content and style features. Meanwhile, the EFD module utilizes two types of edge images as additional constraints to optimize the generator. Experimental results show that we can generate more realistic and higher-quality images compared to other methods, proving the effectiveness of our DNIT.



Paperid:2712
Authors:Shixuan Liu, Chao Wang, Shuangyong Song
Australian National University, China Telecom Corporation Ltd. Data&AI Technology Company, China Telecom Corporation Ltd. Data&AI Technology Company
Abstract:
Pretrained language models have shown their high performance of text processing in intelligent customer service platforms. However, these models do not leverage domain specific information. In this paper, we propose icsPLMs optimized for intelligent customer service on both word and sentence levels. Our experimental results represent that using targeted strategies can further improve the performance of pre-trained language models in this field.



Paperid:2713
Authors:Alexandru Lopotenco, Ian Tong Pan, Jack Zhang, Guan Xiong Qiao
University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania
Abstract:
Unsupervised learning methods such as principal component analysis (PCA), tdistributed stochastic neighbor embedding (t-SNE), and autoencoding are regularly used in dimensionality reduction within the statistical learning scene. However, despite a pivot toward fairness and explainability in machine learning over the past few years, there have been few rigorous attempts toward a generalized framework of fair and explainable representation learning. Our paper explores the possibility of such a framework that leverages maximum mean discrepancy to remove information derived from a protected class from generated representations. For the optimization, we introduce a binary search component to optimize the Lagrangian coefficients. We present rigorous mathematical analysis and experimental results of our framework applied to t-SNE.



Paperid:2714
Authors:Fangyuan Luo, Jun Wu
Beijing Jiaotong University, Beijing Jiaotong University
Abstract:
Hashingbased recommendation (HR) methods, whose core idea is mapping users and items into hamming space, are common practice to improve item retrieval efficiency. However, existing HR fails to align optimization objective (i.e., Bayesian Personalized Ranking) and evaluation metric (i.e., Recall), leading to suboptimal performance. In this paper, we propose a smooth recall loss (termed as SRLoss), which targets Recall as the optimization objective. Due to the existence of discrete constraints, the optimization problem is NP-hard. To this end, we propose an approximation-adjustable gradient estimator to solve our problem. Experimental Results demonstrate the effectiveness of our proposed method.



Paperid:2715
Authors:Yuefeng Ma, Zhongchao He, Shumei Wang
Qufu Normal University, Qufu Normal University, Qufu Normal University
Abstract:
The proliferation of social media exacerbates information fragmentation, posing challenges to understanding public events. We address the problem of event reconstruction with a novel Multiview Contrast Event Reconstruction (MCER) model. MCER maximizes feature dissimilarity between different views of the same event using contrastive learning, while minimizing mutual information between distinct events. This aggregates fragmented views to reconstruct comprehensive event representations. MCER employs momentum and weight-sharing encoders in a three-tower architecture with supervised contrastive loss for multi-view representation learning. Due to the scarcity of multi-view public datasets, we construct a new Mul-view-data benchmark.Experiments demonstrate MCER’s superior performance on public data and our Mul-view-data, significantly outperforming selfsupervised methods by incorporating supervised contrastive techniques. MCER advances multi-view representation learning to counter information fragmentation and enable robust event understanding.



Paperid:2716
Authors:Wei Mao, Guihong Wan, Haim Schweitzer
The University of Texas at Dallas, Harvard University, The University of Texas at Dallas
Abstract:
Spectral clustering is a powerful clustering technique. It leverages the spectral properties of graphs to partition data points into meaningful clusters. The most common criterion for evaluating multiway spectral clustering is NCut. Column Subset Selection is an important optimization technique in the domain of feature selection and dimension reduction which aims to identify a subset of columns of a given data matrix that can be used to approximate the entire matrix. We show that column subset selection can be used to compute spectral clustering and use this to obtain new graph clustering algorithms.



Paperid:2717
Authors:Michał Maras, Michał Kępa, Jakub Kowalski, Marek Szykuła
University of Wrocław, University of Wrocław, University of Wrocław, University of Wroclaw
Abstract:
We develop a method of adapting the AlphaZero model to General Game Playing (GGP) that focuses on faster model generation and requires less knowledge to be extracted from the game rules. The dataset generation uses MCTS playing instead of selfplay; only the value network is used, and attention layers replace the convolutional ones. This allows us to abandon any assumptions about the action space and board topology. We implement the method within the Regular Boardgames GGP system and show that we can build models outperforming the UCT baseline for most games efficiently.



Paperid:2718
Authors:Josué Martínez-Martínez, Olivia Brown, Rajmonda Caceres
University of Connecticut MIT Lincoln Laboratory, MIT Lincoln Laboratory, MIT Lincoln Laboratory
Abstract:
This research focuses on improving the robustness of machine learning systems to natural variations and distribution shifts. A design trade space is presented, and various methods are compared, including adversarial training, data augmentation techniques, and novel approaches inspired by modelbased robust optimization formulations.



Paperid:2719
Authors:Vladimir Mashurov, Natalia Semenova
Sberbank PJSC ITMO University, AIRI Sberbank PJSC
Abstract:
Topological and node noise filtration are typically considered separately. Graph Neural Networks (GNN) are commonly used for node noise filtration, as they offer high efficiency and low exploitation costs. This paper explores the solution of joint node and topological noise filtration through the use of graph neural networks. Since treating a 3D mesh as a graph is challenging, an indicator function grid representation is employed as input for GNNs to perform the joint filtering. The resulting machine learning model is inspired by point cloud to mesh reconstruction algorithms and demonstrates low computational requirements during inference, producing successful results for smooth, watertight 3D models.



Paperid:2720
Authors:Woohyeon Moon, Sarvar Nengroo, Taeyoung Kim, Jihui Lee, Seungah Son, Dongsoo Har
Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology
Abstract:
Optical character recognition(OCR) is the technology to identify text characters embedded within images. Conventional OCR models exhibit performance degradation when performing with noisy images. To solve this problem, we propose a novel model, which combines computer vision using optical sensor with natural language processing by bidirectional encoder representations from transformers(BERT) and cosine similarity scoring. The proposed model uses a confidence rate to determine whether to utilize optical sensor alone or BERT/cosine similarity scoring combined with the optical sensor. Experimental results show that the proposed model outperforms approximately 4.34 times better than the conventional OCR.



Paperid:2721
Authors:Nikita Morozov, Artur Ignatiev, Yuriy Dementiev
Constructor University, HSE University, HSE University
Abstract:
Fair division is a topic that has significant social and industrial value. In this work, we study allocations that simultaneously satisfy definitions of fairness and efficiency: EFx and PO. First, we prove that the problem of finding such allocations is NPhard for two agents. Then, we propose a concept for an ILP-based solving algorithm, the running time of which depends on the number of EFx allocations. We generate input data and analyze algorithm's running time based on the results obtained.



Paperid:2722
Authors:Christine Mwase, Albert Njoroge Kahira, Zhuo Zou
Fudan University, Julich Supercomputing Center, Fudan University
Abstract:
Deep learning (DL), despite its success in various fields, remains expensive and inaccessible to many due to its need for powerful supercomputing and highend GPUs. This study explores alternative computing infrastructure and methods for distributed DL on low-energy, low-cost devices. We experiment on Raspberry Pi 4 devices with ARM Cortex-A72 processors and train a ResNet-18 model on the CIFAR-10 dataset. Our findings reveal limitations and opportunities for future optimizations, paving the way for a DL toolset for low-energy edge devices.



Paperid:2723
Authors:Atharva Naik, Yash Parag Butala, Navaneethan Vaikunthan, Raghav Kapoor
Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University, Carnegie Mellon University
Abstract:
When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skillaware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.



Paperid:2724
Authors:Dexter Neo
National University of Singapore
Abstract:
We present a new, simple and effective loss function for calibrating graph neural networks (GNNs). Miscalibration is the problem whereby a model's probabilities does not reflect it's correctness, making it difficult and possibly dangerous for realworld deployment. We compare our method against other baselines on a novel ID and OOD graph form of the Celeb-A faces dataset. Our findings show that our method improves calibration for GNNs, which are not immune to miscalibration in-distribution (ID) and out-of-distribution (OOD). Our code is available for review at https://github.com/dexterdley/CS6208/tree/main/Project.



Paperid:2725
Authors:Tzeh Yuan Neoh, Nicholas Teh
University of Oxford, University of Oxford
Abstract:
We study the computational problems associated with maximizing various welfare objectives—namely utilitarian welfare, egalitarian welfare, and Nash welfare—in perpetual voting, a sequential collective decisionmaking framework. Prior work look into notions of fairness over time and study extensions of single-round voting rules to the multi-round setting. We show that while a utilitarian-welfare maximizing outcome can be computed efficiently, an outcome that maximizes egalitarian or Nash welfare is computationally intractable, even in the case of two candidates. We complement this by showing that maximizing egalitarian welfare is fixed-parameter tractable in the number of agents, and maximizing egalitarian or Nash welfare is W[2]-hard and slicewise polynomial in the number of timesteps. We also provide an approximation algorithm for maximizing egalitarian welfare and study strategyproofness with respect to these welfare objectives. Finally, we show that a simple greedy algorithm can achieve approximate proportionality in this setting.



Paperid:2726
Authors:Simin Niu, Xun Liang, Sensen Zhang, Shichao Song, Xuan Zhang, Xiaoping Zhou
Renmin University of China, Renmin University of China, Renmin University of China, Renmin University of China, Renmin University of China Peking University, Beijing University of Civil Engineering and Architecture
Abstract:
Crossdomain Graph Meta-learning (CGML) has shown its promise, where meta-knowledge is extracted from few-shot graph data in multiple relevant but distinct domains. However, several recent efforts assume target data available, which commonly does not established in practice. In this paper, we devise a novel Cross-domain Data Augmentation for Graph Meta-Learning (CDA-GML), which incorporates the superiorities of CGML and Data Augmentation, has addressed intractable shortcomings of label sparsity, domain shift, and the absence of target data simultaneously. Specifically, our method simulates instance-level and task-level domain shift to alleviate the cross-domain generalization issue in conventional graph meta-learning. Experiments show that our method outperforms the existing state-of-the-art methods.



Paperid:2727
Authors:Aleksander Obuchowski, Barbara Klaudel, Piotr Frąckowski, Sebastian Krajna, Wasyl Badyra, Michał Czubenko, Zdzisław Kowalczuk
Polish-Japanese Academy of Information Technology, Gdańsk University of Technology, Gdańsk University of Technology, Gdańsk University of Technology, Gdańsk University of Technology, Gdańsk University of Technology, Gdańsk University of Technology
Abstract:
The population characteristics of the datasets related to the same task may vary significantly and merging them may harm performance. In this paper, we propose a novel method of domain adaptation called "crossadaptation". It allows for implicit adaptation to the target domain without the need for any labeled examples across this domain. We test our approach on 9 datasets for SARS-CoV-2 detection from complete blood count from different hospitals around the world. Results show that our solution is universal with respect to various classification algorithms and allows for up to a 10pp increase in F1 score on average.



Paperid:2728
Authors:James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, Shirin Sohrabi
Rensselaer Polytechnic Institute, IBM Research, IBM Research, IBM Research, IBM Research, IBM Research
Abstract:
The creation of planning models, and in particular domain models, is among the last bastions of tasks that require extensive manual labor in AI planning; it is desirable to simplify this process for the sake of making planning more accessi- ble. To this end, we investigate whether large language mod- els (LLMs) can be used to generate planning domain models from textual descriptions. We propose a novel task for this as well as a means of automated evaluation for generated do- mains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains. Our results show that LLMs, particularly larger ones, exhibit some level of proficiency in generating correct planning domains from natural language descriptions



Paperid:2729
Authors:Dylan Pallickara, Sarath Sreedharan
Poudre High School, Fort Collins, CO, Colorado State University, Fort Collins, CO
Abstract:
We describe our methodology for classifying ASL (American Sign Language) gestures. Rather than operate directly on raw images of hand gestures, we extract coordinates and render wireframes from individual images to construct a curated training dataset. This dataset is then used in a classifier that is memory efficient and provides effective performance (94% accuracy). Because we con-struct wireframes that contain information about several angles in the joints that comprise hands, our methodolo-gy is amenable to training those interested in learning ASL by identifying targeted errors in their hand gestures.



Paperid:2730
Authors:Suraj Pandey
Indian Institute of Technology Guwahati
Abstract:
Evolutionary Algorithms (EA) have been leveraged to tackle the challenges faced while using GANs such as mode collapse, vanishing gradient, latent space search, etc. However, the existing techniques of using EA with GANs operate backpropagation and EA in isolation from each other, leaving ample room for further exploration. This paper creates a collaborative bridge between EA and GANs by exploring a neuroevolution method for utilising both EA and backpropagationbased optimisation, simultaneously, for a multi-generator GAN architecture. Experiments conducted using a standard dataset with variants of the proposed method highlight the towering impact of each of the components involved in the proposed method.



Paperid:2731
Authors:Shikang Pang, Chunjing Xiao, Wenxin Tai, Zhangtao Cheng, Fan Zhou
Henan University, Henan University, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Graph anomaly detection has gained significant research interest across various domains. Due to the lack of labeled data, contrastive learning has been applied in detecting anomalies and various scales of contrastive strategies have been initiated. However, these methods might force two instances (e.g., nodelevel and subgraph-level representations) with different category labels to be consistent during model training, which can adversely impact the model robustness. To tackle this problem, we present a novel contrastive learning framework with the Diffusion model-based graph Enhancement module for Graph Anomaly Detection, DEGAD. In this framework, we design a diffusion model-based graph enhancement module to manipulate neighbors to generate enhanced graphs, which can efficiently alleviate the inconsistent problem. Further, based on the enhanced graphs, we present a multi-scale contrastive module to discriminate anomalies. Experimental results demonstrate the superiority of our model.



Paperid:2732
Authors:Bumgeun Park, Taeyoung Kim, Quoc-Vinh Lai-Dang, Dongsoo Har
Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology, Korea Advanced Institute of Science and Technology
Abstract:
Efficient exploration for an agent is challenging in reinforcement learning (RL). In this paper, a novel actorcritic framework namely virtual action actor-critic (VAAC), is proposed to address the challenge of efficient exploration in RL. This work is inspired by humans' ability to imagine the potential outcomes of their actions without actually taking them. In order to emulate this ability, VAAC introduces a new actor called virtual actor (VA), alongside the conventional actor-critic framework. Unlike the conventional actor, the VA takes the virtual action to anticipate the next state without interacting with the environment. With the virtual policy following a Gaussian distribution, the VA is trained to maximize the anticipated novelty of the subsequent state resulting from a virtual action. If any next state resulting from available actions does not exhibit high anticipated novelty, training the VA leads to an increase in the virtual policy entropy. Hence, high virtual policy entropy represents that there is no room for exploration. The proposed VAAC aims to maximize a modified Q function, which combines cumulative rewards and the negative sum of virtual policy entropy. Experimental results show that the VAAC improves the exploration performance compared to existing algorithms.



Paperid:2733
Authors:June-Young Park, Jae-Ryung Hong, Min-Hye Kim, Tae-Joon Kim
Department of Convergence Healthcare Medicine, Graduate School of Ajou University, Department of Computer Science, Ewha Womans University, Department of Neurology, Ajou University School of Medicine, Department of Neurology, Ajou University School of Medicine
Abstract:
Anomaly detection is a critical task across various domains. Fundamentally, anomaly detection models offer methods to identify unusual patterns that do not align with expected behaviors. Notably, in the medical field, detecting anomalies in medical imagery or biometrics can facilitate early diagnosis of diseases. Consequently, we propose the SkipGANomaly++ model, an enhanced and more efficient version of the conventional anomaly detection models. The proposed model's performance was evaluated through comparative experiments. Experimental results demonstrated superior performance across most classes compared to the previous models.



Paperid:2734
Authors:Morgan Payette, Charlotte Curtis
Mount Royal University, Mount Royal University
Abstract:
Rendering of complex scenes from software such as Blender is time consuming, but corresponding auxiliary data such as depth or object segmentation maps are relatively fast to generate. The auxiliary data also provides a wealth of information for tasks such as optical flow prediction. In this paper we present the QuickRender dataset, a collection of procedurally generated scenes rendered into over 5,000 sequential image triplets along with accompanying auxiliary data. The goal of this dataset is to provide a diversity of scenes and motion while maintaining realistic behaviours. A sample application using this dataset to perform single image super resolution is also presented. The dataset and related source code can be found at https://github.com/MPmtroyal/MetaSRGAN.



Paperid:2735
Authors:Jiaming Pei, Wei Li, Lukun Wang
University of Sydney, University of Sydney, Shandong University of Science and Technology
Abstract:
Communication overhead remains a significant challenge in federated learning due to frequent global model updates. Essentially, the update of the global model can be viewed as knowledge transfer. We aim to transfer more knowledge through a compact model while reducing communication overhead. In our study, we introduce a federated learning framework where clients pretrain large models locally and the server initializes a compact model to communicate. This compact model should be light in size but still have enough knowledge to refine the global model effectively. We facilitate the knowledge transfer from local to global models based on pre-training outcomes. Our experiments show that our approach significantly reduce communication overhead without sacrificing accuracy.



Paperid:2736
Authors:Magdalena Proszewska, Marcin Mazur, Tomasz Trzciński, Przemysław Spurek
University of Edinburgh, Jagiellonian University in Kraków, Warsaw University of Technology IDEAS NCBR Tooploox, Jagiellonian University in Kraków
Abstract:
Implicit field representations offer an effective way of generating 3D object shapes. They leverage an implicit decoder (IMNET) trained to take a 3D point coordinate concatenated with a shape encoding and to output a value indicating whether the point is outside the shape. This approach enables the efficient rendering of visually plausible objects but also has some significant limitations, resulting in a cumbersome training procedure and empty spaces within the rendered mesh. In this paper, we introduce a new HyperCube architecture based on interval arithmetic that enables direct processing of 3D voxels, trained using a hypernetwork paradigm to enforce model convergence. The code is available at https://github.com/mproszewska/hypercube.



Paperid:2737
Authors:Md Maklachur Rahman, Tracy Hammond
Texas A&M University, College Station, TX, USA, Texas A&M University, College Station, TX, USA
Abstract:
Despite Siamese trackers’ substantial potential, they offer suboptimal tracking performance in low-resolution (LR) contexts. We introduce a Random Noise Salient Feature Fusion Learning Network to address this issue. This method integrates random noise-infused feature maps into a similaritylearning matching model. This integration acts as an effective regularization technique, enhancing the network’s generalization capabilities in LR environments. Additionally, by integrating attention mechanisms, we enhance the discriminative ability of the network, assigning more weights to important features. This directs the network’s focus toward the most salient regions of the feature map, ensuring improved accuracy without a significant increase in parameter overhead, and maintaining a high operating speed. To validate the effectiveness of our method, we performed qualitative and quantitative comparisons with state-of-the-art (SOTA) trackers.



Paperid:2738
Authors:Aryaman Rao, Parth Singh, Dinesh Kumar Vishwakarma, Mukesh Prasad
Delhi Technological University, Delhi Technological University, Delhi Technological University, University of Technology Sydney
Abstract:
Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantumbased Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low efficacy. The proposed method, guided by quantum principles, offers a promising solution for Influence Maximisation. Experiments on four real-world datasets reveal DQSSA's superior performance as compared to established cutting-edge algorithms.



Paperid:2739
Authors:Célian Ringwald, Fabien Gandon, Catherine Faron, Franck Michel, Hanna Abi Akl
Université Côte d’Azur, Inria, CNRS, I3S, Université Côte d’Azur, Inria, CNRS, I3S Data ScienceTech Institute, Université Côte d’Azur, Inria, CNRS, I3S, Université Côte d’Azur, Inria, CNRS, I3S, Université Côte d’Azur, Inria, CNRS, I3S Data ScienceTech Institute
Abstract:
Seqto-seq generative models recently gained attention for solving the relation extraction task. By approaching this problem as an end-to-end task, they surpassed encoder-based-only models. Little research investigated the effects of the output syntaxes on the training process of these models. Moreover, a limited number of approaches were proposed for extracting ready-to-load knowledge graphs following the RDF standard. In this paper, we consider that a set of triples can be linearized in many different ways, and we evaluate the combined effect of the size of the language models and different RDF syntaxes on the task of relation extraction from Wikipedia abstracts.



Paperid:2740
Authors:Syed Sameen Ahmad Rizvi, Aryan Seth, Pratik Narang
Birla Institute of Technology & Science, Pilani, Birla Institute of Technology & Science, Pilani, Birla Institute of Technology & Science, Pilani
Abstract:
Facial Expression Recognition (FER) is an extensively explored research problem in the domain of computer vision and artificial intelligence. FER, a supervised learning problem, requires significant training data representative of multiple sociocultural demographic attributes. However, most of the FER dataset consists of images annotated by humans, which propagates individual and demographic biases. This work attempts to mitigate this bias using representation learning based on latent spaces, thereby increasing a deep learning model's fairness and overall accuracy.



Paperid:2741
Authors:Brandon Rozek, Junkyu Lee, Harsha Kokel, Michael Katz, Shirin Sohrabi
Rensselaer Polytechnic Institute, IBM Research AI, IBM Research AI, IBM Research AI, IBM Research AI
Abstract:
Partially observable Markov decision processes (POMDPs) challenge reinforcement learning agents due to incomplete knowledge of the environment. Even assuming monotonicity in uncertainty, it is difficult for an agent to know how and when to stop exploring for a given task. In this abstract, we discuss how to use hierarchical reinforcement learning (HRL) and AI Planning (AIP) to improve exploration when the agent knows possible valuations of unknown predicates and how to discover them. By encoding the uncertainty in an abstract planning model, the agent can derive a highlevel plan which is then used to decompose the overall POMDP into a tree of semi-POMDPs for training. We evaluate our agent's performance on the MiniGrid domain and show how guided exploration may improve agent performance.



Paperid:2742
Authors:Jeremy Rutter, Maneesh Reddy Chamakura, Justin Delgado, Gene Louis Kim
University of South Florida, University of South Florida, University of South Florida, University of South Florida
Abstract:
Our work explores bridging the gap between large language models and textto-image models to create a tool for quickly and easily generating high quality images from a given concept. In our experiments we successfully improved image quality with only a preliminary utilization of the available resources for finetuning.



Paperid:2743
Authors:Sehyun Ryu, Hosung Joo, Jonggyu Jang, Hyun Jong Yang
POSTECH, POSTECH, POSTECH, POSTECH
Abstract:
Recent research has shown a growing interest in perinstance differential privacy (pDP), highlighting the fact that each data instance within a dataset may incur distinct levels of privacy loss. However, conventional additive noise mechanisms apply identical noise to all query outputs, thereby deteriorating data statistics. In this study, we propose an instance-wise Laplace mechanism, which adds non-identical Laplace noises to the query output for each data instance. A challenge arises from the complex interaction of additive noise, where the noise introduced to individual instances impacts the pDP of other instances, adding complexity and resilience to straightforward solutions. To tackle this problem, we introduce an instance-wise Laplace mechanism algorithm via deep reinforcement learning and validate its ability to better preserve data statistics on a real dataset, compared to the original Laplace mechanism.



Paperid:2744
Authors:Richard Sances, Olivera Kotevska, Paul Laiu
Virginia Polytechnic Institute and State University, Oak Ridge National Laboratory, Oak Ridge National Laboratory
Abstract:
As data privacy issues grow, finding the best privacy preservation algorithm for each situation is increasingly essential. This research has focused on understanding the frequency oracles (FO) privacy preservation algorithms. FO conduct the frequency estimation of any value in the domain. The aim is to explore how each can be best used and recommend which one to use with which data type. We experimented with different data scenarios and federated learning settings. Results showed clear guidance on when to use a specific algorithm.



Paperid:2745
Authors:Ryan Schuerkamp, Xinyu Li, Brian Kunzer, Leonard S. Weiss, Hernando Gómez, Francis X. Guyette, Michael R. Pinsky, Artur Dubrawski
Auton Lab, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA College of Engineering and Computing, Miami University, Oxford, OH 45056, USA, Auton Lab, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA, Auton Lab, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA, Auton Lab, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Abstract:
Fluid resuscitation is an initial treatment frequently employed to treat shock, restore lost blood, protect tissues from injury, and prevent organ dysfunction in critically ill patients. However, it is not without risk (e.g., overly aggressive resuscitation may cause organ damage and even death). We leverage machine learning models trained to assess sufficiency of resuscitation in laboratory animals subjected to induced hemorrhage and transfer them to use with human trauma patients. Our key takeaway is that animal experiments and models can inform human healthcare, especially when human data is limited or when collecting relevant human data via potentially harmful protocols is unfeasible.



Paperid:2746
Authors:Devan Shah, Ruoqi Huang, Tingting Chen, Murtuza Jadliwala
Princeton University, California State Polytechnic University, Pomona, California State Polytechnic University, Pomona, University of Texas, San Antonio
Abstract:
Current practice of mobility scooter user authentication using physical keys and traditional passwordbased one-time security mechanisms cannot meet the needs of many mobility scooter riders, especially senior citizens having issues in recalling memory. Now seamless authentication approaches are needed to provide ongoing protection for mobility scooters against takeovers and unauthorized access. Existing continuous authentication techniques do not work well in a mobility scooter setting due to issues such as user comfort, deployment cost and enrollment time, among others. In that direction, our contributions in this research effort are two-fold: (i) we propose a novel system that incorporates advances in few-shot learning, hierarchical processing, and contextual embedding to establish continuous authentication for mobility scooter riders using only posture data. This security system, trained on data collected from real mobility scooter riders, demonstrates quick enrollment and easy deployability, while successfully serving as an unobtrusive first layer of security. (ii) we provide to the research community the largest publicly available repository of mobility scooter riders' body key-points data to enable further research in this direction.



Paperid:2747
Authors:Takumu Shimizu, Ryota Higa, Katsuhide Fujita, Shinji Nakadai
Tokyo University of Agriculture and technology National Institute of Advanced Industrial Science and Technology, National Institute of Advanced Industrial Science and Technology NEC Data Science Research Laboratories, Tokyo University of Agriculture and Technology National Institute of Advanced Industrial Science and Technology, Intent Exchange, Inc. NEC Corporation
Abstract:
We propose an automated negotiation for a reinforcement learning agent to adapt the agent to unexpected situations such as demand changes in supply chain management (SCM). Existing studies that consider reinforcement learning and SCM assume a centralized environment where the coordination of chain components is hierarchical rather than through negotiations between agents. This study focused on a negotiation agent that considered the value function of reinforcement learning for SCM as its utility function in automated negotiation. We demonstrated that the proposed approach could avoid inventory shortages under increased demand requests from the terminal customer.



Paperid:2748
Authors:Wenzheng Shu, Yanlong Huang, Wenxin Tai, Zhangtao Cheng, Bei Hui, Goce Trajcevski
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China, Iowa State University
Abstract:
Trip recommendation aims to plan user’s travel based on their specified preferences. Traditional heuristic and statistical approaches often fail to capture the intricate nuances of user intentions, leading to subpar performance. Recent deeplearning methods show attractive accuracy but struggle to generate faithful trajectories that match user intentions. In this work, we propose a DDPM-based incremental knowledge injection module to ensure the faithfulness of the generated trajectories. Experiments on two datasets verify the effectiveness of our approach.



Paperid:2749
Authors:Akshit Singh
Indian Institute of Technology, Jodhpur, India
Abstract:
Generative Artificial Intelligence (AI) has garnered significant attention for its remarkable ability to generate text, images, and other forms of content. However, an inherent and increasingly concerning issue within generative AI systems is bias. These AI models often exhibit an Anglocentric bias and tend to overlook the importance of diversity. This can be attributed to their training on extensive datasets sourced from the internet, which inevitably inherit the biases present in those data sources. Employing these datasets leads to AI-generated content that mirrors and perpetuates existing biases, encompassing various aspects such as gender, ethnic and cultural stereotypes. Addressing bias in generative AI is a complex challenge that necessitates substantial efforts. In order to tackle this issue, we propose a methodology for constructing moderately sized datasets with a social inclination. These datasets can be employed to rectify existing imbalances in datasets or to train models to generate socially inclusive material. Additionally, we present preliminary findings derived from training our model on these socially inclined datasets.



Paperid:2750
Authors:Abhishek Sinha, Himanshi Tibrewal, Mansi Gupta, Nikhar Waghela, Shivank Garg
Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India - 247667, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India - 247667, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India - 247667, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India - 247667, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India - 247667
Abstract:
In this evolving era of machine learning security, membership inference attacks have emerged as a potent threat to the confidentiality of sensitive data. In this attack, adversaries aim to determine whether a particular point was used during the training of a target model. This paper proposes a new method to gauge a data point’s membership in a model’s training set. Instead of correlating loss with membership, as is traditionally done, we have leveraged the fact that training examples generally exhibit higher confidence values when classified into their actual class. During training, the model is essentially being ’fit’ to the training data and might face particular difficulties in generalization to unseen data. This asymmetry leads to the model achieving higher confidence on the training data as it exploits the specific patterns and noise present in the training data. Our proposed approach leverages the confidence values generated by the machinelearning model. These confidence values provide a probabilistic measure of the model’s certainty in its predictions and can further be used to infer the membership of a given data point. Additionally, we also introduce another variant of our method that allows us to carry out this attack without knowing the ground truth(true class) of a given data point, thus offering an edge over existing label-dependent attack methods.



Paperid:2751
Authors:Jan Sobotka, Petr Šimánek
Faculty of Information Technology, Czech Technical University in Prague, Faculty of Information Technology, Czech Technical University in Prague
Abstract:
Modern machine learning heavily relies on optimization, and as deep learning models grow more complex and datahungry, the search for efficient learning becomes crucial. Learned optimizers disrupt traditional handcrafted methods such as SGD and Adam by learning the optimization strategy itself, potentially speeding up training. However, the learned optimizers' dynamics are still not well understood. To remedy this, our work explores their optimization trajectories from the perspective of network architecture symmetries and proposed parameter update distributions.



Paperid:2752
Authors:William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi
Université de Montréal Mila, Université de Montréal, Université de Montréal McGill University Mila
Abstract:
Generative Flow Networks, known as GFlowNets, have been introduced in recent times, presenting an exciting possibility for neural networks to model distributions across various data structures. In this paper, we broaden their applicability to encompass scenarios where the data structures are optimal solutions of a combinatorial problem. Concretely, we propose the use of GFlowNets to learn the distribution of optimal solutions for kidney exchange problems (KEPs), a generalized form of matching problems involving cycles.



Paperid:2753
Authors:Linpeng Sun, Victor S. Sheng
Texas Tech University, Texas Tech University
Abstract:
With the help of Vision transformers (ViTs), medical image segmentation was able to achieve outstanding performance. In particular, they overcome the limitation of convolutional neural networks (CNNs) which rely on local receptive fields. ViTs use selfattention mechanisms to consider relationships between all image pixels or patches simultaneously. However, they require large datasets for training and did not perform well on capturing low-level features. To that end, we propose DDViT, a novel ViT model that unites a CNN to alleviate data-hunger for medical image segmentation with two multi-scale feature representations. Significantly, our approach incorporates a ViT with a plug-in domain adapter (DA) with Double-Level Fusion (DLF) technique, complemented by a mutual knowledge distillation paradigm, facilitating the seamless exchange of knowledge between a universal network and specialized domain-specific network branches. The DLF framework plays a pivotal role in our encoder-decoder architecture, combining the innovation of the TransFuse module with a robust CNN-based encoder. Extensive experimentation across diverse medical image segmentation datasets underscores the remarkable efficacy of DDViT when compared to alternative approaches based on CNNs and Transformer-based models.



Paperid:2754
Authors:Adrian Swindle, Derrick McNealy, Giri Krishnan, Ramyaa Ramyaa
Saint Louis University, University of Southern Mississippi, University of California, San Diego, New Mexico Institute of Mining and Technology
Abstract:
Obfuscation intends to decrease interpretability of code and identification of code behavior. Large Language Models(LLMs) have been proposed for code synthesis and code analysis. This paper attempts to understand how well LLMs can analyse code and identify code behavior. Specifically, this paper systematically evaluates several LLMs’ capabilities to detect obfuscated code and identify behavior across a variety of obfuscation techniques with varying levels of complexity. LLMs proved to be better at detecting obfuscations that changed identifiers, even to misleading ones, compared to obfuscations involving code insertions (unused variables, as well as variables that replace constants with expressions that evaluate to those constants). Hardest to detect were obfuscations that layered multiple simple transformations. For these, only 2040% of the LLMs’ responses were correct. Adding misleading documentation was also successful in misleading LLMs. We provide all our code to replicate results at https://github.com/SwindleA/LLMCodeObfuscation. Overall, our results suggest a gap in LLMs’ ability to understand code.



Paperid:2755
Authors:Hui Tang, Xun Liang, Sensen Zhang
Renmin University of China, Renmin University of China, Renmin University of China
Abstract:
Detecting anomalies on attributed graphs is a challenging task since labelled anomalies are highly labourintensive by taking specialized domain knowledge to make anomalous samples not as available as normal ones. Moreover, graphs contain complex structure information as well as attribute information, leading to anomalies that can be typically hidden in the structure space, attribute space, and the mix of both. In this paper, we propose a novel model for graph anomaly detection named ProGAD. Specifically, ProGAD takes advance of label propagation to infer high-quality pseudo labels by considering the structure and attribute inconsistencies between normal and abnormal samples. Meanwhile, ProGAD introduces the prior knowledge of class distribution to correct and refine pseudo labels with a prototype-aware strategy. Experiments demonstrate that ProGAD achieves strong performance compared with the current state-of-the-art methods.



Paperid:2756
Authors:Cindy Tong, Rosanna Chan
Nanyang Technological University, Singapore The Chinese University of Hong Kong, Hong Kong, The Chinese University of Hong Kong, Hong Kong Centre for Perceptual and Interactive Intelligence, Hong Kong
Abstract:
Gaze estimation is an important research area in computer vision and machine learning. Eyetracking and gaze-based interactions have made assistive technology (AT) more accessible to people with physical limitations. However, a non-negligible proportion of existing AT users, including those having dyskinetic cerebral palsy (CP) or severe intellectual disabilities (ID), have difficulties in using eye trackers due to their involuntary body movements. In this paper, we propose an adaptation method pertaining to head movement prediction and fixation smoothing to stabilize our target users' gaze points on the screen and improve their user experience (UX) in gaze-based interaction. Our empirical experimentation shows that our method significantly shortens the users' selection time and increases their selection accuracy.



Paperid:2757
Authors:Aryan Tyagi, Aryaman Rao, Shubhanshu Rao, Raj Kumar Singh
Delhi Technological University, Delhi Technological University, Delhi Technological University, Delhi Technological University
Abstract:
Chronic Obstructive Pulmonary Disorder (COPD) is a prevalent respiratory disease that significantly impacts the quality of life of affected individuals. This paper presents COPDFlowNet, a novel deep-learning framework that leverages a custom Generative Adversarial Network (GAN) to generate synthetic Computational Fluid Dynamics (CFD) velocity flow field images specific to the trachea of COPD patients. These synthetic images serve as a valuable resource for data augmentation and model training. Additionally, COPD-FlowNet incorporates a custom Convolutional Neural Network (CNN) architecture to predict the location of the obstruction site.



Paperid:2758
Authors:Guihong Wan, Wei Mao, Yevgeniy R. Semenov, Haim Schweitzer
Massachusetts General Hospital Harvard Medical School Harvard T. H. Chan School of Public Health, University of Texas at Dallas, Massachusetts General Hospital Harvard Medical School, University of Texas at Dallas
Abstract:
The common criteria for evaluating spectral clustering are NCut and RatioCut. The seemingly unrelated column subset selection (CSS) problem aims to compute a column subset that linearly approximates the entire matrix. A common criterion is the approximation error in the Frobenius norm (ApproxErr). We show that any algorithm for CSS can be viewed as a clustering algorithm that minimizes NCut by applying it to a matrix formed from graph edges. Conversely, any clustering algorithm can be seen as identifying a column subset from that matrix. In both cases, ApproxErr and NCut have the same value. Analogous results hold for RatioCut with a slightly different matrix. Therefore, established results for CSS can be mapped to spectral clustering. We use this to obtain new clustering algorithms, including an optimal one that is similar to A*. This is the first nontrivial clustering algorithm with such an optimality guarantee. A variant of the weighted A* runs much faster and provides bounds on the accuracy. Finally, we use the results from spectral clustering to prove the NPhardness of CSS from sparse matrices.



Paperid:2759
Authors:Deliang Wang
The University of Hong Kong
Abstract:
This paper explores proposing interpreting methods from explainable artificial intelligence to address the interpretability issues in deep learningbased models for classroom dialogue. Specifically, we developed a Bert-based model to automatically detect student talk moves within classroom dialogues, utilizing the TalkMoves dataset. Subsequently, we proposed three generic interpreting methods, namely saliency, input*gradient, and integrated gradient, to explain the predictions of classroom dialogue models by computing input relevance (i.e., contribution). The experimental results show that the three interpreting methods can effectively unravel the classroom dialogue analysis, thereby potentially fostering teachers' trust.



Paperid:2760
Authors:Kuang-Da Wang, Wei-Yao Wang, Yu-Tse Chen, Yu-Heng Lin, Wen-Chih Peng
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University
Abstract:
The growing demand for precise sports analysis has been explored to improve athlete performance in various sports (e.g., basketball, soccer). However, existing methods for different sports face challenges in validating strategies in environments due to simple rulebased opponents leading to performance gaps when deployed in real-world matches. In this paper, we propose the CoachAI Badminton Environment, a novel reinforcement learning (RL) environment with realistic opponents for badminton, which serves as a compelling example of a turn-based game. It supports researchers in exploring various RL algorithms with the badminton context by integrating state-of-the-art tactical-forecasting models and real badminton game records. The Badminton Benchmarks are proposed with multiple widely adopted RL algorithms to benchmark the performance of simulating matches against real players. To advance novel algorithms and developments in badminton analytics, we make our environment open-source, enabling researchers to simulate more complex badminton sports scenarios based on this foundation. Our code is available at https://github.com/wywyWang/CoachAI-Projects/tree/main/CoachAI%20Badminton%20Environment.



Paperid:2761
Authors:Yutao Wei, Wenzheng Shu, Zhangtao Cheng, Wenxin Tai, Chunjing Xiao, Ting Zhong
University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China Kash Institute of Electronics and Information Industry, Kashgar 844000, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China Kash Institute of Electronics and Information Industry, Kashgar 844000, China, Henan University, Kaifeng, Henan 475000, China, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China Kash Institute of Electronics and Information Industry, Kashgar 844000, China
Abstract:
Graph anomaly detection has received remarkable research interests, and various techniques have been employed for enhancing detection performance. However, existing models tend to learn datasetspecific spurious correlations based on statistical associations. A well-trained model might suffer from performance degradation when applied to newly observed nodes with different environments. To handle this situation, we propose CounterFactual Graph Anomaly Detection model, CFGAD. In this model, we design a gradient-based separator to disentangle node features into class features and environment features. Then, we present a weight-varying diffusion model to combine class features and environment features from different nodes to generate counterfactual samples. These counterfactual samples will be adopted to enhance model robustness. Comprehensive experiments demonstrate the effectiveness of our CFGAD.



Paperid:2762
Authors:Keefer P. Wu, Patricia C. Tsang
Northeastern University, Boston, MA, MedStar Health, Washington, DC
Abstract:
Can advanced AIdriven technologies transform the traditionally arduous educational process in medicine? This study takes a deep dive into how the publicly available OpenAI ChatGPT-3.5 performs in answering board-style questions designed for physicians training to become pathologists. Correctly answering 75% of 543 questions using an engaging and fast-paced format was an impressive performance. It underscores the potential as well as improvement opportunities of using interactive AI in future medical training.



Paperid:2763
Authors:Ruoyu Xu, Zhenyu Xu, Gaoxiang Li, Victor S. Sheng
Texas Tech University, Texas Tech University, Texas Tech University, Texas Tech University
Abstract:
Reverse engineering involves analyzing the design, architecture, and functionality of systems, and is crucial for legacy systems. Legacy systems are outdated software systems that are still in use and often lack proper documentation, which makes their maintenance and evolution challenging. To address this, we introduce SC2Req, utilizing the Generative Pretrained Transformer (GPT) for automated code analysis and requirement generation. This approach aims to convert source code into understandable requirements and bridge the gap between those two. Through experiments on diverse software projects, SC2Req shows the potential to enhance the accuracy and efficiency of the translation process. This approach not only facilitates faster software development and easier maintenance of legacy systems but also lays a strong foundation for future research, promoting better understanding and communication in software development.



Paperid:2764
Authors:Zhenyu Xu, Ruoyu Xu, Victor S. Sheng
Texas Tech University, Texas Tech University, Texas Tech University
Abstract:
In the era of large language models like Chatgpt, maintaining academic integrity in programming education has become challenging due to potential misuse. There's a pressing need for reliable detectors to identify Chatgptgenerated code. While previous studies have tackled model-generated text detection, identifying such code remains uncharted territory. In this paper, we introduce a novel method to discern Chatgpt-generated code. We employ targeted masking perturbation, emphasizing code sections with high perplexity. Fine-tuned CodeBERT is utilized to replace these masked sections, generating subtly perturbed samples. Our scoring system amalgamates overall perplexity, variations in code line perplexity, and burstiness. In this scoring scheme, a higher rank for the original code suggests it's more likely to be chatgpt-generated. The underlying principle is that code generated by models typically exhibits consistent, low perplexity and reduced burstiness, with its ranking remaining relatively stable even after subtle modifications. In contrast, human-written code, when perturbed, is more likely to produce samples that the model prefers. Our approach significantly outperforms current detectors, especially against OpenAI's text-davinci-003 model, with the average AUC rising from 0.56 (GPTZero baseline) to 0.87.



Paperid:2765
Authors:Boshen Yan, Guihong Wan, Haim Schweitzer, Zoltan Maliga, Sara Khattab, Kun-Hsing Yu, Peter K. Sorger, Yevgeniy R. Semenov
Department of Dermatology, Massachusetts General Hospital, Harvard Medical School Department of Biomedical Informatics, Harvard Medical School, Department of Dermatology, Massachusetts General Hospital, Harvard Medical School Laboratory of Systems Pharmacology, Department of Systems Biology, Harvard Medical School, Department of Computer Science, University of Texas at Dallas, Laboratory of Systems Pharmacology, Department of Systems Biology, Harvard Medical School, Department of Dermatology, Massachusetts General Hospital, Harvard Medical School, Department of Biomedical Informatics, Harvard Medical School, Laboratory of Systems Pharmacology, Department of Systems Biology, Harvard Medical School, Department of Dermatology, Massachusetts General Hospital, Harvard Medical School Laboratory of Systems Pharmacology, Department of Systems Biology, Harvard Medical School
Abstract:
Graph spectral clustering is a fundamental technique in data analysis, which utilizes eigenpairs of the Laplacian matrix to partition graph vertices into clusters. However, classical spectral clustering algorithms require eigendecomposition of the Laplacian matrix, which has cubic time complexity. In this work, we describe passefficient spectral clustering algorithms that leverage recent advances in randomized eigendecomposition and the structure of the graph vertex-edge matrix. Furthermore, we derive formulas for their efficient implementation. The resulting algorithms have a linear time complexity with respect to the number of vertices and edges and pass over the graph constant times, making them suitable for processing large graphs stored on slow memory. Experiments validate the accuracy and efficiency of the algorithms.



Paperid:2766
Authors:Kai Yang, Jiayang Li, Wenxin Tai, Zhenhui Li, Ting Zhong, Guangqiang Yin, Yong Wang
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, Hong Kong University of Science and Technology
Abstract:
Accurate IP geolocation is indispensable for locationaware applications. While recent advances based on router-centric IP graphs are considered cutting-edge, one challenge remain: the prevalence of sparse IP graphs (14.24% with fewer than 10 nodes, 9.73% isolated) limits graph learning. To mitigate this issue, we designate the target host as the central node and aggregate multiple last-hop routers to construct the target-centric IP graph, instead of relying solely on the router with the smallest last-hop latency as in previous works. Experiments on three real-world datasets show that our method significantly improves the geolocation accuracy compared to existing baselines.



Paperid:2767
Authors:Wenxue Ye, Shichong Li, Zhangtao Cheng, Xovee Xu, Ting Zhong, Bei Hui, Fan Zhou
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry
Abstract:
Information diffusion prediction is a critical task for many social network applications. However, current methods are mainly limited by the following aspects: user relationships behind resharing behaviors are complex and entangled. To address these issues, we propose MHGFormer, a novel multichannel hypergraph transformer framework, to better decouple complex user relations and obtain fine-grained user representations. First, we employ designed triangular motifs to decouple user relations into three different level hypergraphs. Second, a position-aware hypergraph transformer is used to refine user relation and obtain high-quality user representations. Extensive experiments conducted on two social datasets demonstrate that MHGFormer outperforms state-of-the-art diffusion models across several settings.



Paperid:2768
Authors:Liu Yu, Fenghui Tian, Ping Kuang, Fan Zhou
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Conventional commonsense knowledge graph completion (CKGC) methods provide inadequate sequence when finetuning or generating stages and incorporate full fine-tuning, which fail to align with the autoregressive model's pre-training patterns and have insufficient parameter efficiency. Moreover, decoding through beam or greedy search produces low diversity and high similarity in generated tail entities. Hence, we resort to prefix-tuning and propose a lightweight, effective pipeline to enhance the quality and diversity of extracted commonsense knowledge. Precisely, we measure head entity similarity to yield and then concatenate top-k tuples before each target tuple for prefix-tuning the source LM, thereby improving the efficiency and speed for pretrained models; then, we design a penalty-tailored diverse beam search (p-DBS) for decoding tail entities, producing a greater quantity and diversity of generated commonsense tuples; besides, a filter strategy is utilized to filter out invalid commonsense knowledge. Through extensive automatic evaluations, including ChatGPT scoring, our method can extract diverse, novel, and accurate commonsense knowledge (CK).



Paperid:2769
Authors:Liu Yu, Ludie Guo, Ping Kuang, Fan Zhou
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
Pretrained language models (PLMs) have greatly transformed various downstream tasks, yet frequently display social biases from training data, raising fairness concerns. Recent efforts to debias PLMs come with limitations: they either fine-tune the entire parameters in PLMs, which is time-consuming and disregards the expressiveness of PLMs, or ignore the reintroducing biases from downstream tasks when applying debiased models to them. Hence, we propose a two-stage pipeline to mitigate biases from both internal and downstream contexts while preserving expressiveness in language models. Specifically, for the debiasing procedure, we resort to continuous prefix-tuning, not fully fine-tuning the PLM, in which we design a debiasing term for optimization and an alignment term to keep words’ relative distances and ensure the model's expressiveness. For downstream tasks, we perform causal intervention across different demographic groups for invariant predictions. Results on three GLUE tasks show our method alleviates biases from internal and downstream contexts, while keeping PLM expressiveness intact.



Paperid:2770
Authors:Wenhuan Zeng, Daniel H. Huson
Tübingen University, University of Tuebingen
Abstract:
DNA methylation is an epigenetic mechanism for regulating gene expression, and it plays an important role in many biological processes. While methylation sites can be identified using laboratory techniques, much work is being done on developing computational approaches using machine learning. Here, we present a deeplearning algorithm for determining the 5-methylcytosine status of a DNA sequence. We propose an ensemble framework that treats the self-attention score as an explicit feature that is added to the encoder layer generated by fine-tuned language models. We evaluate the performance of the model under different data distribution scenarios.



Paperid:2771
Authors:Jienan Zhang, Jie Liu, Zhangtao Cheng, Xovee Xu, Fang Liu, Ting Zhong, Kunpeng Zhang
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, Civil Aviation Flight University of China, University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry, The University of Maryland
Abstract:
Social media popularity prediction of multimodal usergenerated content (UGC) is a crucial task for many real-world applications. However, existing efforts are often limited by missing inter-instance correlations and UGC temporal patterns. To address these issues, we propose a novel time-aware hypergraph Transformer framework, THGFormer. It fully represents inter-instance and intra-instance relations by hypergraphs, captures the temporal dependencies with a time encoder, and enhances UGC's representations via a neighborhood knowledge aggregation. Extensive experiments conducted on two real-world datasets demonstrate that THGFormer outperforms state-of-the-art popularity prediction models across several settings.



Paperid:2772
Authors:Sensen Zhang, Xun Liang, Simin Niu, Xuan Zhang, Chen Feng, Yuefeng Ma
School of Information, Renmin University of China, School of Information, Renmin University of China, School of Information, Renmin University of China, Harvest Fund Management Co., Ltd Guanghua School of Management, Peking University School of Information, Renmin University of China, School of Information, Renmin University of China, School of Computer, Qufu Normal University
Abstract:
Researchers have applied knowledge graph embedding (KGE) techniques with advanced neural network techniques, such as capsule networks, for predicting drugdrug interactions (DDIs) and achieved remarkable results. However, most ignore molecular structure and position features between drug pairs. They cannot model the biomedical field's significant relational mapping properties (RMPs,1-N, N-1, N-N) relation. To solve these problems, we innovatively propose CDHse that consists of two crucial modules: 1) Entity embedding module, we obtain position feature obtained by PubMedBERT and Convolutional Neural Network (CNN), obtain molecular structure feature with Graphic Nuaral Network (GNN), obtain entity embedding feature of drug pairs, and then incorporate these features into one synthetic feature. 2) Knowledge graph embedding module, the synthetic feature is Householder projections and then embedded in the complex vector space for training. In this paper, we have selected several advanced models for the DDIs task and performed experiments on three standard BioKG to validate the effectiveness of CDHse.



Paperid:2773
Authors:Ye Zhang, Yanqi Gao, Yupeng Zhou, Jianan Wang, Minghao Yin
Northeast Normal University, Northeast Normal University, Northeast Normal University, Northeast Normal University, Northeast Normal University
Abstract:
With the abundance of learning resources available on massive open online courses (MOOCs) platforms, the issue of interactive data sparsity has emerged as a significant challenge.This paper introduces MRMLREC, an efficient MOOC video recommendation which consists of two main stages: multirelational representation and multi-level recommendation, aiming to solve the problem of data sparsity. In the multi-relational representation stage, MRMLREC adopts a tripartite approach, constructing relational graphs based on temporal sequences, courses-videos relation, and knowledge concepts-video relation. These graphs are processed by a Graph Convolution Network (GCN) and two variant Graph Attention Networks (GAT) to derive representations. A variant of the Long Short-Term Memory Network (LSTM) then integrates these multi-dimensional data to enhance the overall representation. The multi-level recommendation stage introduces three prediction tasks at varying levels—courses, knowledge concepts, and videos—to mitigate data sparsity and improve the interpretability of video recommendations. Beam search (BS) is employed to identify top-β items at each level, refining the subsequent level's search space and enhancing recommendation efficiency. Additionally, an optional layer offers both personalization and diversification modes, ensuring variety in recommended videos and maintaining learner engagement. Comprehensive experiments demonstrate the effectiveness of MRMLREC on two real-world instances from Xuetang X.



Paperid:2774
Authors:Chenxu Zhao, Wei Qian, Yucheng Shi, Mengdi Huai, Ninghao Liu
Iowa State University, Iowa State University, University of Georgia, Iowa State University, University of Georgia
Abstract:
Interpreting deep neural networks through examining neurons offers distinct advantages when it comes to exploring the inner workings of Deep Neural Networks. Previous research has indicated that specific neurons within deep vision networks possess semantic meaning and play pivotal roles in model performance. Nonetheless, the current methods for generating neuron semantics heavily rely on human intervention, which hampers their scalability and applicability. To address this limitation, this paper proposes a novel posthoc framework for generating semantic explanations of neurons with large foundation models, without requiring human intervention or prior knowledge. Experiments are conducted with both qualitative and quantitative analysis to verify the effectiveness of our proposed approach.



Paperid:2775
Authors:Brian Zhou, Jason Geder, Kamal Viswanath, Alisha Sharma, Julian Lee
Thomas Jefferson High School for Science and Technology Naval Research Laboratory, Naval Research Laboratory, Naval Research Laboratory, Naval Research Laboratory University of Maryland, Yale University
Abstract:
Flappingfin unmanned underwater vehicle (UUV) propulsion systems enable high maneuverability for tasks ranging from station-keeping to surveillance but are often constrained by their limited computational power and battery capacity. Previous research has demonstrated that time-series neural network models can accurately predict the thrust and power of certain fin kinematics based on the specified gait coupled with the fin configuration, but can not fit an inverse neural network that takes a thrust request and tunes the kinematics by weighting thrust generation, smooth movement transitions, and power attributes. We study various combinations of the three weights and fin materials to create different ‘modes’ of movement for a multi-objective UUV, based on controller intent using an inverse neural network. Finally, we implement and validate an enhanced power-aware inverse model by benchmarking on the Raspberry Pi Model 4B system and testing through generated simulated movements.



Paperid:2776
Authors:Yujian Zhu, Hao Ding, Zongzhang Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China, National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Abstract:
Ad hoc teamwork is a crucial challenge that aims to design an agent capable of effective collaboration with teammates employing diverse strategies without prior coordination. However, current PopulationBased Training (PBT) approaches train the ad hoc agent through interaction with diverse teammates from scratch, which suffer from low efficiency. We introduce Multi-Expert Distillation (MED), a novel approach that directly distills diverse strategies through modeling across-episodic sequences. Experiments show that our algorithm achieves more efficient and stable training and has the ability to improve its behavior using historical contexts. Our code is available at https://github.com/LAMDA-RL/MED.



Paperid:2777
Authors:Oluwanifemi Adebayo Moses Adekanye
Bowen University
Abstract:
This paper outlines a proposal exploring the potential use of Large Language Models (LLMs), particularly GPT4, in crafting realistic synthetic environments for self-driving scenarios. The envisioned approach involves dynamic scene generation within game engines, leveraging LLMs to introduce challenging elements for autonomous vehicles. The proposed evaluation process outlines assessments such as realistic testing, safety metrics, and user interaction, aiming to set the stage for potential improvements in self-driving system performance. The paper aims to contribute to the AI field by discussing how LLMs could be utilized to create valuable testing grounds for autonomous vehicles, potentially fostering the development of more robust self-driving technology. The envisioned impact is the eventual enhancement of road safety and the possible acceleration of the adoption of autonomous vehicles, paving the way for a future with safer and more efficient transportation.



Paperid:2778
Authors:Varun Ananth
Paul G. Allen School of Computer Science, University of Washington
Abstract:
Considering that the human brain is the most powerful, generalizable, and energyefficient computer we know of, it makes the most sense to look to neuroscience for ideas regarding deep learning model improvements. I propose one such idea, augmenting a traditional Advantage-Actor-Critic (A2C) model with additional learning signals akin to those in the brain. Pursuing this direction of research should hopefully result in a new reinforcement learning (RL) control paradigm that can learn from fewer examples, train with greater stability, and possibly consume less energy.



Paperid:2779
Authors:Amy Au
University of British Columbia
Abstract:
This research explores the discourse surrounding red teaming and aims to identify any themes in the online discussion of potential environmental harms stemming from Large Language Models (LLMs). Focusing on the AI Red Teaming event at DEFCON 31, this study employs reflexive thematic analysis on diverse social networking site sources to extract insights into public discussion of LLM red teaming and its environmental implications. The findings intend to inform future research, highlighting the need for responsible AI development that addresses environmental concerns.



Paperid:2780
Authors:Adam Baji
University of Maryland, Baltimore County
Abstract:
This study leverages Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to enhance diagnostics and predictions in healthcare. By training on extensive healthcare datasets, this project aims to improve early disease detection and health risk assessments. Evaluation emphasizes accuracy, reliability, and ethical considerations, including bias mitigation. This research promises to bridge AI advancements and clinical applications, offering significant improvements in diagnostic capabilities and healthcare accessibility.



Paperid:2781
Authors:Hanlin Cai
National University of Ireland Maynooth, Maynooth, Co. Kildare, Ireland Maynooth International Engineering College, Fuzhou University, Fujian, China
Abstract:
As the most popular lowpower communication protocol, cybersecurity research on Bluetooth Low Energy (BLE) has garnered significant attention. Due to BLE’s inherent security limitations and firmware vulnerabilities, spoofing attacks can easily compromise BLE devices and tamper with privacy data. In this paper, we proposed BLEGuard, a hybrid detection mechanism combined cyber-physical features with learning-based techniques. We established a physical network testbed to conduct attack simulations and capture advertising packets. Four different network features were utilized to implement detection and classification algorithms. Preliminary results have verified the feasibility of our proposed methods.



Paperid:2782
Authors:Minghai Chen
University of British Columbia
Abstract:
Event camera has unique advantages in high temporal resolution and dynamic range and has shown potentials in several computer vision tasks. However, due to the novelty of this hardware, there’s a lack of large benchmark DVS eventstream datasets, including datasets for object recognition. In this work, we proposed an encoder-decoder method to augment event stream dataset from image and optical flow with arbitrary temporal resolution for object recognition task. We believe this proposed method can be generalized well in augmenting event stream vision data for object recognition and will help advance the development of event vision paradigm.



Paperid:2783
Authors:Chahana Dahal
Westminster University
Abstract:
This proposal introduces an innovative AIpowered learning system designed to address educational disparities worldwide. Focused on developing countries, the system seamlessly translates educational content between English and native languages, breaking down language barriers. Leveraging advanced natural language processing and machine learning techniques, including transformer models like BERT and GPT-3, the system ensures inclusivity, effectiveness, and engagement. Built on prior research demonstrating AI's efficacy in language translation and personalized learning, the proposed system draws inspiration from successful projects like Duolingo Language Incubator. By providing inclusive and accessible learning experiences, it empowers individuals to overcome language barriers, fostering global participation. The potential impact is significant, with the system poised to accelerate learning, enhance literacy rates, and create a more skilled workforce in developing countries. This research reflects a commitment to revolutionize education through technology, aiming for lasting and transformative contributions to global society. Through AI-driven education, a brighter, more inclusive future is envisioned.



Paperid:2784
Authors:Joseph Fatoye
Bowen University
Abstract:
In the pursuit of creating more effective and adaptable robots, the flourishing field of cognitive robotics has arisen to infuse machines with humanlike cognitive functions. This paper delves into the significance of cognitive robotics and charts a course for empowering robots with advanced cognitive capabilities. Drawing inspiration from current research in cognitive architectures, the paper underscores the importance of refined perception, language processing, complex decision-making, emotional intelligence, and cognitive synergy. By integrating these cognitive functions into robotic systems, the goal is to equip robots to operate intelligently in dynamic environments, collaborate seamlessly with humans, and adeptly handle diverse tasks. The proposed enhancements mark crucial strides towards the development of more versatile and capable intelligent robots.



Paperid:2785
Authors:Isaiah Gallardo
Auburn University
Abstract:
Constructing road networks manually is a time consuming and laborintensive process. This paper proposes a new method to iteratively construct road networks using reinforcement learning from a combined tensor-based representation of satellite image and GPS trajectory data.



Paperid:2786
Authors:Cassandra Goldberg
Bowdoin College
Abstract:
This paper proposes a novel approach for Synthetic Aperture Radar (SAR) image segmentation by incorporating known statistical properties of SAR into deep learning models. We generate synthetic data using the Generalized Gamma distribution, modify the UNet architecture to encompass statistical moments, and employ stochastic distance losses for improved segmentation performance. Evaluation against traditional methods will reveal the potential of this approach to advance SAR image analysis, with broader applications in environmental monitoring and general image segmentation tasks.



Paperid:2787
Authors:Hannah Guan
Harvard College
Abstract:
Aging is a complex stochastic process that affects healthy functioning through various pathways. In contrast to the more commonly used crosssectional methods, our research focuses on longitudinal modeling of aging, a less explored but crucial area. We have developed a Stochastic Differential Equation (SDE) model, at the forefront of aging research, designed to accurately forecast the health trajectories and survival rates of individuals. This model adeptly delineates the connections between different health indicators and provides clear, interpretable results. Our approach utilizes the SDE framework to encapsulate the inherent uncertainty in the aging process. Moreover, it incorporates a Recurrent Neural Network (RNN) to integrate past health data into future health projections. We plan to train and test our model using a comprehensive dataset tailored for aging studies. This model is not only computationally cost-effective but also highly relevant in assessing health risks in older populations, particularly for those at high risk. It can serve as an essential tool in anticipating and preparing for challenges like infectious disease outbreaks. Overall, our research aims to improve health equity and global health security significantly, offering substantial benefits to public health and deepening our understanding of the aging process.



Paperid:2788
Authors:Javon Hickmon
University of Washington, Seattle, WA
Abstract:
Artificial intelligence has made significant progress in image classification, an essential task for machine perception to achieve humanlevel image understanding. Despite recent advances in vision-language fields, multimodal image classification is still challenging, particularly for the following two reasons. First, models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Second, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. Here, we utilize ensemble learning to reduce the impact of these issues on pre-trained models. We aim to create a meta-model that combines the predictions of multiple open-vocabulary multimodal models trained on different data to create more robust and accurate predictions. By utilizing ensemble learning and multimodal machine learning, we will achieve higher prediction accuracies without any additional training or fine-tuning, meaning that this method is completely zero-shot.



Paperid:2789
Authors:Fiona Luo
University of Pennsylvania, Philadelphia, PA
Abstract:
In this work, we use VisionLanguage Models (VLMs) as a binary success detector given a robot observation and task description, formulated as a Visual Question Answering (VQA) problem. We fine-tune the open-source MiniGPT-4 VLM to detect success on robot trajectories from the Berkeley Bridge and Berkeley AUTOLab UR5 datasets. We find that while a handful of test distribution trajectories can train an accurate detector, transferring learning between different environments is challenging due to distribution shift. In addition, while our VLM is robust to language variations, it is less robust to visual variations. In the future, more powerful VLMs such as Gemini and GPT-4 have the potential to be more accurate and robust success detectors, and success detectors can provide a sparse binary reward to improve existing policies.



Paperid:2790
Authors:Erica Okeh
Howard University
Abstract:
The integration of Artificial Intelligence (AI) into Augmented Reality (AR) for medical applications is propelled by the aim to address evident healthcare disparities. Certain communities have encountered disparities in medical diagnoses, exemplified by Black individuals exhibiting a 2.4 times higher likelihood of schizophrenia diagnosis compared to their white counterparts (Faber et al., 2023). These disparities often arise from structured interview assessments overlooking cultural nuances, resulting in increased misdiagnosis rates. This study leverages AI and AR to develop unbiased diagnostic tools and enhance empathy in healthcare professionals' training. Uniquely prioritizing the reduction of biased language and the fostering of empathy through AIdriven Natural Language Processing (NLP) and AI-driven virtual patients, the research aims to enhance diagnostic accuracy while promoting cultural sensitivity among healthcare professionals. Aligned with broader goals of achieving equitable healthcare and reducing disparities, the evaluation involves pre- and post-training assessments to measure language improvements and empathy enhancements. Successful implementation could lead to a more equitable healthcare landscape, fostering trust in AI-driven systems and ensuring fairer medical care for diverse communities.



Paperid:2791
Authors:Nilton Rojas
National University of Engineering
Abstract:
This research investigates the generalization capabilities of neural networks in deep learning when applied to realworld scenarios where data often contains imperfections, focusing on their adaptability to both noisy and non-noisy scenarios for image retrieval tasks. Our study explores approaches to preserve all available data, regardless of quality, for diverse tasks. The evaluation of results varies per task, due to the ultimate goal of developing a technique to extract relevant information while disregarding noise in the final network design for each specific task. The aim is to enhance accessibility and efficiency of AI across diverse tasks, particularly for individuals or countries with limited resources, lacking access to high-quality data. The dedication is directed towards fostering inclusivity and unlocking the potential of AI for wide-spread societal benefit.



Paperid:2792
Authors:Kevin Shen
The University of British Columbia
Abstract:
World Models are made of generative networks that can predict future states of a single environment which it was trained on. This research proposes a Multiworld Model, a foundational model built from World Models for the field of continual reinforcement learning that is trained on many different environments, enabling it to generalize state sequence predictions even for unseen settings.



Paperid:2793
Authors:Tanisha Shende
Oberlin College
Abstract:
Visual art facilitates expression, communication, and connection, yet it remains inaccessible to those who are visuallyimpaired and those who lack the resources to understand the techniques and history of art. In this work, I propose the development of a generative AI model that generates a description and interpretation of a given artwork. Such research can make art more accessible, support art education, and improve the ability of AI to understand and translate between creative media. Development will begin with a formative study to assess the needs and preferences of blind and low vision people and art experts. Following the formative study, the basic approach is to train the model on a database of artworks and their accompanying descriptions, predict sentiments from extracted visual data, and generate a paragraph closely resembling training textual data and incorporating sentiment analysis. The model will then be evaluated quantitatively through metrics like METEOR and qualitatively through Turing tests in an iterative process.



Paperid:2794
Authors:Yitong Tang
University of British Columbia, 2329 West Mall Trusted and Efficient AI (TEA) Lab
Abstract:
This study introduces FedAW, a novel federated learning algorithm that uses a weighted aggregation mechanism sensitive to the quality of client datasets, leading to better model performance and faster convergence on diverse datasets, validated using Colored MNIST.



Paperid:2795
Authors:Ada Tur
McGill University
Abstract:
Recent advancements in deep learning have the potential to transform the process of writing and creating music. Models that have the potential to capture and analyze higherlevel representations of music and audio can serve to change the field of digital signal processing. In this statement, I propose a set of Music+AI methods that serves to assist with the writing of and melodies, modelling and transferring of timbres, applying a wide variety of audio effects, including research into experimental audio effects, and production of audio samples using style transfers. Writing and producing music is a tedious task that is notably difficult to become proficient in, as many tools to create music both cost sums money and require long-term commitments to study. An all-encompassing framework for music processing would make the process much more accessible and simple and would allow for human art to work alongside technology to advance.



Paperid:2796
Authors:Zhengguang Wang
University of Virginia, Charlottesville, Virginia
Abstract:
This work undertakes studies to evaluate Interpretability Methods for Time Series Deep Learning. Sensitivity analysis assesses how input changes affect the output, constituting a key component of interpretation. Among the posthoc interpretation methods such as back-propagation, perturbation, and approximation, my work will investigate perturbation-based sensitivity Analysis methods on modern Transformer models to benchmark their performances. Specifically, my work intends to answer three research questions: 1) Do different sensitivity analysis methods yield comparable outputs and attribute importance rankings? 2) Using the same sensitivity analysis method, do different Deep Learning models impact the output of the sensitivity analysis? 3) How well do the results from sensitivity analysis methods align with the ground truth?



Paperid:2797
Authors:Munkhtulga Battogtokh, Cosmin Davidescu, Michael Luck, Rita Borgo
King's College London, ContactEngine, King's College London, King's College London
Abstract:
Finegrained text classification requires models to distinguish between many fine-grained classes that are hard to tell apart. However, despite the increased risk of models relying on confounding features and predictions being especially difficult to interpret in this context, existing work on the interpretability of fine-grained text classification is severely limited. Therefore, we introduce our visual analysis system, SemLa, which incorporates novel visualization techniques that are tailored to this challenge. Our evaluation based on case studies and expert feedback shows that SemLa can be a powerful tool for identifying model weaknesses, making decisions about data annotation, and understanding the root cause of errors.



Paperid:2798
Authors:Tathagata Chakraborti, Jungkoo Kang, Francesco Fuggitti, Michael Katz, Shirin Sohrabi
IBM Research, IBM Research, IBM Research, IBM Research, IBM Research
Abstract:
We present Lemming – a visualization tool for the interactive selection of plans for a given problem, allowing the user to efficiently whittle down the set of plans and select their plan(s) of choice. We demonstrate four different user experiences for this process, three of them based on the principle of using disjunctive action landmarks as guidance to cut down the set of choice points for the user, and one on the use of linear temporal logic (LTL) to impart additional constraints into the plan set using natural language (NL) instruction.



Paperid:2799
Authors:Rohan Chandra, Zayne Sprague, Joydeep Biswas
UT Austin, UT Austin, UT Austin
Abstract:
We present Social Gym 2.0, a simulator for multiagent navigation research. Our simulator enables navigation for multiple autonomous agents, replicating real-world dynamics in complex indoor environments, including doorways, hallways, intersections, and roundabouts. Unlike current simulators that concentrate on single robots in open spaces, Social Gym 2.0 employs multi-agent reinforcement learning (MARL) to develop optimal navigation policies for multiple robots with diverse, dynamic constraints in complex environments. Social Gym 2.0 also departs from the accepted software design standards by employing a configuration-over-convention paradigm providing the capability to benchmark different MARL algorithms, as well as customize observation and reward functions. Users can additionally create their own environments and evaluate various algorithms, based on both deep reinforcement learning as well as classical navigation, using a broad range of social navigation metrics.



Paperid:2800
Authors:Simone Conia, Daniel Lee, Min Li, Umar Farooq Minhas, Yunyao Li
Sapienza University of Rome, Italy, University of Calgary, Canada, Apple, Apple, Adobe
Abstract:
Translating entity names, especially when a literal translation is not correct, poses a significant challenge. Although Machine Translation (MT) systems have achieved impressive results, they still struggle to translate cultural nuances and languagespecific context. In this work, we show that the integration of multilingual knowledge graphs into MT systems can address this problem and bring two significant benefits: i) improving the translation of utterances that contain entities by leveraging their human-curated aliases from a multilingual knowledge graph, and, ii) increasing the interpretability of the translation process by providing the user with information from the knowledge graph.



Paperid:2801
Authors:Mingzhe Du, Anh Tuan Luu, Bin Ji, See-Kiong Ng
Nanyang Technological University National University of Singapore, Nanyang Technological University, National University of Singapore, National University of Singapore
Abstract:
The immense parameter space of Large Language Models (LLMs) endows them with superior knowledge retention capabilities, allowing them to excel in a variety of natural language processing tasks. However, it also instigates difficulties in consistently tuning LMs to incorporate the most recent knowledge, which may further lead LMs to produce inaccurate and fabricated content. To alleviate this issue, we propose a knowledge metabolism framework for LLMs. This framework proactively sustains the credibility of knowledge through an auxiliary external memory component and directly delivers pertinent knowledge for LM inference, thereby suppressing hallucinations caused by obsolete internal knowledge during the LM inference process. Benchmark experiments demonstrate DynaMind's effectiveness in overcoming this challenge. The code and demo of DynaMind are available at: https://github.com/Elfsong/DynaMind.



Paperid:2802
Authors:Clyde Fare, George K. Holt, Lamogha Chiazor, Michail Smyrnakis, Robert Tracey, Lan Hoang
IBM Research Europe, STFC Hartree, IBM Research Europe, STFC Hartree, IBM Research Europe, IBM Research Europe
Abstract:
AIdriven materials discovery is evolving rapidly with new approaches and pipelines for experimentation and design. However, the pipelines are often designed in isolation. We introduce a modular reinforcement learning framework for inter-operable experimentation and design of tailored, novel molecular species. The framework unifies reinforcement learning (RL) pipelines and allows the mixing and matching of choices for the underlying chemical action space, molecular representation, desired molecular properties, and RL algorithm. Our demo showcases the framework's capabilities applied to benchmark problems like quantitative estimate of drug-likeness and PLogP, as well as the design of novel small molecule solvents for carbon capture.



Paperid:2803
Authors:Shubh Goyal, Medha Hira, Shubham Mishra, Sukriti Goyal, Arnav Goel, Niharika Dadu, Kirushikesh DB, Sameep Mehta, Nishtha Madaan
IIT Jodhpur, IIITD, IIT Jodhpur, IIT Jodhpur, IIITD, IIT Jodhpur, IBM Research India, IBM, India Research Lab, IBM Research
Abstract:
Although the rise of Large Language Models (LLMs) in enterprise settings brings new opportunities and capabilities, it also brings challenges, such as the risk of generating inappropriate, biased, or misleading content that violates regulations and can have legal concerns. To alleviate this, we present "LLMGuard", a tool that monitors user interactions with an LLM application and flags content against specific behaviours or conversation topics. To do this robustly, LLMGuard employs an ensemble of detectors.



Paperid:2804
Authors:Jiatong Han, Warut Suksompong
National University of Singapore, National University of Singapore
Abstract:
Fair division, the study of how to fairly allocate resources among agents, has received substantial interest in the areas of artificial intelligence and multiagent systems. While there is an extensive theoretical literature on fair division by now, the developed algorithms are still mostly confined to research papers and inaccessible to the public. We attempt to bridge this gap by developing Fast & Fair, an opensource web application that hosts a number of fair allocation algorithms with user-friendly interfaces and explainable outcomes. In contrast to existing implementations, Fast & Fair is a collaborative platform that is open to community contributions and thereby facilitates the deployment of additional algorithms.



Paperid:2805
Authors:Jun Hu, Phil Miller, Michael Lomnitz, Saurabh Farkya, Emre Yilmaz, Aswin Raghavan, David Zhang, Michael Piacentino
SRI International, SRI International, SRI International, SRI International, SRI International, SRI International, SRI International, SRI International
Abstract:
A robotic workshop assistant has been a longstanding grand challenge for robotics, speech, computer vision, and artificial intelligence (AI) research. We revisit the goal of visual identification of tools from human queries in the current era of Large Vision-and-Language models (like GPT-4). We find that current off-the-shelf models (that are trained on internet images) are unable to overcome the domain shift and unable to identify small, obscure tools in cluttered environments. Furthermore, these models are unable to match tools to their intended purpose or affordances. We present a novel system for online domain adaptation that can be run directly on a small on-board processor. The system uses Hyperdimensional Computing (HD), a fast and efficient neuromorphic method. We adapted CLIP to work with explicit ("I need the hammer") and implicit purpose-driven queries ("Drive these nails"), and even with depth images as input. This demo allows the user to try out various real tools and interact via free-form audio.



Paperid:2806
Authors:Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Yuexian Zou, Zhou Zhao, Shinji Watanabe
Zhejiang University, Zhejiang University, Peking University, Carnegie Mellon University, Carnegie Mellon University, Zhejiang University, Remin University of China, Zhejiang University, Zhejiang University, Zhejiang University, Zhejiang University, Peking University, Zhejiang University, Carnegie Mellon University
Abstract:
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multimodal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving 16 AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Code can be found in https://github.com/AIGC-Audio/AudioGPT



Paperid:2807
Authors:Saurav Joshi, Filip Ilievski, Jay Pujara
USC Information Sciences Institute, Vrije Universiteit USC Information Sciences Institute, USC Information Sciences Institute
Abstract:
According to WWF, 1.1 billion people lack access to water, and 2.7 billion experience water scarcity at least one month a year. By 2025, twothirds of the world's population may be facing water shortages. This highlights the urgency of managing water usage efficiently, especially in water-intensive sectors like food. This paper proposes a recommendation engine, powered by knowledge graphs, aiming to facilitate sustainable and healthy food consumption. The engine recommends ingredient substitutes in user recipes that improve nutritional value and reduce environmental impact, particularly water footprint. The system architecture includes source identification, information extraction, schema alignment, knowledge graph construction, and user interface development. The research offers a promising tool for promoting healthier eating habits and contributing to water conservation efforts.



Paperid:2808
Authors:Hsing-Yuan Ma, Hen-Hsen Huang, Chao-Lin Liu
Department of Computer Science, National Chengchi University, Institute of Information Science, Academia Sinica, Department of Computer Science, National Chengchi University
Abstract:
Chinese historical documents, with their unique layouts and reading patterns, pose significant challenges for traditional Optical Character Recognition (OCR) systems. This paper introduces a tailored OCR system designed to address these complexities, particularly emphasizing the crucial aspect of Reading Order Detection(ROD). Our system operates through a threefold process: text detection using the Differential Binarization++ model, text recognition with the SVTR Net, and a novel ROD approach harnessing raw image features. This innovative method for ROD, inspired by human perception, utilizes visual cues present in raw images to deduce the inherent sequence of ancient texts. Preliminary results show promising reductions in page error rates. By preserving both content and context, our system contributes meaningfully to the accurate and contextual digitization of Chinese historical manuscripts.



Paperid:2809
Authors:Mingyu Derek Ma, Alexander K. Taylor, Nuan Wen, Yanchen Liu, Po-Nien Kung, Wenna Qin, Shicheng Wen, Azure Zhou, Diyi Yang, Xuezhe Ma, Nanyun Peng, Wei Wang
UCLA, UCLA, University of Southern California, Stanford University Harvard University, UCLA, Stanford University, University of Southern California, Stanford University, Stanford University, University of Southern California, UCLA, UCLA
Abstract:
We present MIDDAG, an intuitive, interactive system that visualizes the information propagation paths on social media triggered by COVID19-related news articles accompanied by comprehensive insights including user/community susceptibility level, as well as events and popular opinions raised by the crowd while propagating the information. Besides discovering information flow patterns among users, we construct communities among users and develop the propagation forecasting capability, enabling tracing and understanding of how information is disseminated at a higher level. A demo video and more are available at https://info-pathways.github.io.



Paperid:2810
Authors:Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas, Lucas Morin, Christoph Auer, Michele Dolfi, Peter Staar
IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Technology, Zürich, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland ETH Zürich, Zürich, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland, IBM Research, Rüschlikon, Switzerland
Abstract:
We present Deep Search DocQA. This application enables information extraction from documents via a questionanswering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.



Paperid:2811
Authors:Terufumi Morishita, Yuta Koreeda, Atsuki Yamaguchi, Gaku Morio, Osamu Imaichi, Yasuhiro Sogawa
Hitachi, Hitachi, Hitachi, Hitachi, Hitachi, Hitachi
Abstract:
We propose a source code search system named CHICOT (Code search with HIgh level COnText) to assist developers in reusing existing code. While previous studies have examined code search on the basis of codelevel, fine-grained specifications such as functionality, logic, or implementation, CHICOT addresses a unique mission: code search with high-level contextual information, such as the purpose or domain of a developer's project. It achieves this feature by first extracting the context information from codebases and then considering this context during the search. It provides a VSCode plugin for daily coding assistance, and the built-in crawler ensures up-to-date code suggestions. The case study attests to the utility of CHICOT in real-world scenarios.



Paperid:2812
Authors:Bharath Muppasani, Vignesh Narayanan, Biplav Srivastava, Michael N. Huhns
University of South Carolina, University of South Carolina, University of South Carolina, University of South Carolina
Abstract:
In the digital age, understanding the dynamics of information spread and opinion formation within networks is paramount. This research introduces an innovative framework that combines the principles of opinion dynamics with the strategic capabilities of Automated Planning. We have developed, to the best of our knowledge, the firstever numeric PDDL tailored for opinion dynamics. Our tool empowers users to visualize intricate networks, simulate the evolution of opinions, and strategically influence that evolution to achieve specific outcomes. By harnessing Automated Planning techniques, our framework offers a nuanced approach to devise sequences of actions tailored to transition a network from its current opinion landscape to a desired state. This holistic approach provides insights into the intricate interplay of individual nodes within a network and paves the way for targeted interventions. Furthermore, the tool facilitates human-AI collaboration, enabling users to not only understand information spread but also devise practical strategies to mitigate potential harmful outcomes arising from it. Demo Video link - https://tinyurl.com/3k7bp99h



Paperid:2813
Authors:Kaushik Roy, Vedant Khandelwal, Valerie Vera, Harshul Surana, Heather Heckman, Amit Sheth
AI Institute, University of South Carolina, AI Institute, University of South Carolina, University Libraries, University of South Carolina, AI Institute, University of South Carolina, University Libraries, University of South Carolina, AI Institute, University of South Carolina
Abstract:
This paper addresses the timeintensive nature of systematic reviews (SRs) and proposes a solution leveraging advancements in Generative AI (e.g., ChatGPT) and external knowledge augmentation (e.g., Retrieval-Augmented Generation). The proposed system, GEAR-Up, automates query development and translation in SRs, enhancing efficiency by enriching user queries with context from language models and knowledge graphs. Collaborating with librarians, qualitative evaluations demonstrate improved reproducibility and search strategy quality. Access the demo at https://youtu.be/zMdP56GJ9mU.



Paperid:2814
Authors:Trinita Roy, Asheesh Kumar, Daksh Raghuvanshi, Siddhant Jain, Goutham Vignesh, Kartik Shinde, Rohan Tondulkar
SciSpace, SciSpace, SciSpace, SciSpace, SciSpace, SciSpace, SciSpace
Abstract:
We introduce SciSpace Copilot, an AI research assistant that helps in understanding and reading research papers faster by providing a plethora of features. Answering questions from a document has recently become popular using the Retrieval Augmented Generation (RAG) approach. Our tool uses an advanced questionanswering pipeline to get accurate answers and also provide exact citations for the same. We provide many more valuable features on scientific text, including generating explanations, generating summaries, adding notes and highlights, and finding related papers from our 200 million corpus. Our tool supports 100+ languages, making research more accessible across language barriers. Thousands of users use SciSpace Copilot on a daily basis by uploading their articles to understand research faster and better. Our tool can be accessed at this link: https://typeset.io.



Paperid:2815
Authors:Soumyendu Sarkar, Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Avisek Naug, Sahand Ghorbanpour
Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise
Abstract:
We present a generic Reinforcement Learning (RL) framework optimized for crafting adversarial attacks on different model types spanning from ECG signal analysis (1D), image classification (2D), and video classification (3D). The framework focuses on identifying sensitive regions and inducing misclassifications with minimal distortions and various distortion types. The novel RL method outperforms stateof-the-art methods for all three applications, proving its efficiency. Our RL approach produces superior localization masks, enhancing interpretability for image classification and ECG analysis models. For applications such as ECG analysis, our platform highlights critical ECG segments for clinicians while ensuring resilience against prevalent distortions. This comprehensive tool aims to bolster both resilience with adversarial training and transparency across varied applications and data types.



Paperid:2816
Authors:Soumyendu Sarkar, Avisek Naug, Antonio Guillen, Ricardo Luna, Vineet Gundecha, Ashwin Ramesh Babu, Sajad Mousavi
Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise, Hewlett Packard Enterprise
Abstract:
The rapid growth of machine learning (ML) has led to an increased demand for computational power, resulting in larger data centers (DCs) and higher energy consumption. To address this issue and reduce carbon emissions, intelligent design and control of DC components such as IT servers, cabinets, HVAC cooling, flexible load shifting, and battery energy storage are essential. However, the complexity of designing and controlling them in tandem presents a significant challenge. While some individual components like CFDbased design and Reinforcement Learning (RL) based HVAC control have been researched, there's a gap in the holistic design and optimization covering all elements simultaneously. To tackle this, we've developed DCRL-Green, a multi-agent RL environment that empowers the ML community to design data centers and research, develop, and refine RL controllers for carbon footprint reduction in DCs. It is a flexible, modular, scalable, and configurable platform that can handle large High Performance Computing (HPC) clusters. Furthermore, in its default setup, DCRL-Green provides a benchmark for evaluating single as well as multi-agent RL algorithms. It easily allows users to subclass the default implementations and design their own control approaches, encouraging community development for sustainable data centers. Open Source Link: https://github.com/HewlettPackard/dc-rl



Paperid:2817
Authors:Mukul Singh, Gust Verbruggen, José Cambronero, Vu Le, Sumit Gulwani
Microsoft, Microsoft, Microsoft, Microsoft, Microsoft
Abstract:
Tools that help with email folder management are limited, as users have to manually write rules to assign emails to folders. We present EMFORE, an iterative learning system that automatically learns and updates such rules from observations. EMFORE is fast enough to suggest and update rules in real time and suppresses mails with low confidence to reduce the number of false positives. EMFORE can use different rule grammars, and thus be adapted to different clients, without changing the user experience. Previous methods do not learn rules, require complete retraining or multiple new examples after making a mistake, and do not distinguish between inbox and other folders. EMFORE learns rules incrementally and can make the neutral decision of leaving emails in the inbox, making it an ideal candidate for integration in email clients.



Paperid:2818
Authors:Inge Vejsbjerg, Elizabeth M. Daly, Rahul Nair, Svetoslav Nizhnichenkov
IBM Research, IBM Research, IBM Research, IBM Research
Abstract:
Bias mitigation algorithms differ in their definition of bias and how they go about achieving that objective. Bias mitigation algorithms impact different cohorts differently and allowing end users and data scientists to understand the impact of these differences in order to make informed choices is a relatively unexplored domain. This demonstration presents an interactive bias mitigation pipeline that allows users to understand the cohorts impacted by their algorithm choice and provide feedback in order to provide a bias mitigated pipeline that most aligns with their goals.



Paperid:2819
Authors:Jiaying Wang, Shuailing Hao, Jing Shan, Xiaoxu Song
Shenyang University of Technology, Shenyang University of Technology, Shenyang University of Technology, Shenyang University of Technology
Abstract:
Visual Language is a multitasking online system focusing on e-commerce, which involves in generating accurate product descriptions for sellers and providing convenient product retrieval service for customers. To achieve this goal, the system adopts image description technology and multi-modal retrieval technology. By utilizing cross-modal generation technique, we could help sellers on rapid uploading products and customers on rapid retrieval, which could improve the experience of both sellers and customers.



Paperid:2820
Authors:Kuang-Da Wang, Yu-Tse Chen, Yu-Heng Lin, Wei-Yao Wang, Wen-Chih Peng
National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University, National Yang Ming Chiao Tung University
Abstract:
We present the CoachAI Badminton Environment, a reinforcement learning (RL) environment tailored for AIdriven sports analytics. In contrast to traditional environments using rule-based opponents or simplistic physics-based randomness, our environment integrates authentic opponent AIs and realistic randomness derived from real-world matches data to bridge the performance gap encountered in real-game deployments. This novel feature enables RL agents to seamlessly adapt to genuine scenarios. The CoachAI Badminton Environment empowers researchers to validate strategies in intricate real-world settings, offering: i) Realistic opponent simulation for RL training; ii) Visualizations for evaluation; and iii) Performance benchmarks for assessing agent capabilities. By bridging the RL environment with actual badminton games, our environment is able to advance the discovery of winning strategies for players. Our code is available at https://github.com/wywyWang/CoachAI-Projects/tree/main/Strategic%20Environment.



Paperid:2821
Authors:Umer Waqas, Yunwan Jeon, Donghun Lee
AItheNutrigene, AItheNutrigene, AItheNutrigene
Abstract:
A significant upsurge in the fashion ecommerce industry in recent years has brought considerable attention to image-based virtual fitting. This image-based technology allows users to try on clothes virtually without physically touching them. However, the current techniques have notable limitations in terms of real-world scenarios, noisy results, partial clothing categories and computational cost, thus limiting the real-world applications. To address these critical limitations, we propose a hybrid interactive network that allows actual users to interact with the system to try on clothes virtually. The network is composed of state of art keypoint extraction, appearance flow alteration and wrapping modules. The pro-posed network facilitates real-time application with high-quality noise-free results, a variety of clothing categories and efficient computational cost.



Paperid:2822
Authors:Lianlong Wu, Seewon Choi, Daniel Raggi, Aaron Stockdill, Grecia Garcia Garcia, Fiorenzo Colarusso, Peter C.H. Cheng, Mateja Jamnik
University of Cambridge, University of Cambridge, University of Cambridge, University of Sussex, University of Sussex, University of Sussex, University of Sussex, University of Cambridge
Abstract:
In this paper we introduce MaRE, a tool designed to generate representations in multiple modalities for a given mathematical problem while ensuring the correctness and interpretability of the transformations between different representations. The theoretical foundation for this tool is Representational Systems Theory (RST), a mathematical framework for studying the structure and transformations of representations. In MaRE’s web frontend user interface, a set of probability equations in Bayesian Notation can be rigorously transformed into Area Diagrams, Contingency Tables, and Probability Trees with just one click, utilising a back-end engine based on RST. A table of cognitive costs, based on the cognitive Representational Interpretive Structure Theory (RIST), that a representation places on a particular profile of user is produced at the same time. MaRE is general and domain independent, applicable to other representations encoded in RST. It may enhance mathematical education and research, facilitating multi-modal knowledge representation and discovery.



Paperid:2823
Authors:Tiancheng Zhang, Shaoyuan Huang, Cheng Zhang, Xiaofei Wang, Wenyu Wang
Tianjin University, Tianjin University, Tianjin University of Finance and Economics, Tianjin University, Paiou Cloud Computing (Shanghai) Co., Ltd, Shanghai, China
Abstract:
Responding to the escalating interest in longterm forecasting within the industry, we introduce EasyTS, a comprehensive toolkit engineered to streamline data collection, analysis, and model creation procedures. EasyTS acts as a unified solution, driving progress in long-term time series forecasting. The platform provides effortless access to various time series datasets, including a newly open-sourced multi-scenario dataset in the electricity domain. Integrated visualization and analysis tools help unveil inherent data features and relationships. EasyTS facilitates a user-friendly model validation approach with versatile evaluation criteria. This toolkit allows researchers to compare their models proficiently against renowned benchmarks. With our ongoing commitment to expanding our dataset collection and enhancing toolkit functionalities, we aspire to contribute significantly to the time series forecasting domain. Code is available at this repository: https://github.com/EdgeBigBang/EasyTS.git.



Paperid:2824
Authors:Zeyuan Zhang, Tanmay Laud, Zihang He, Xiaojie Chen, Xinshuang Liu, Zhouhang Xie, Julian McAuley, Zhankui He
University of California, San Diego, University of California, San Diego, University of California, San Diego, University of California, San Diego, University of California, San Diego, University of California, San Diego, University of California, San Diego, University of California, San Diego
Abstract:
We present a new Python toolkit called RecWizard for Conversational Recommender Systems (CRS). RecWizard offers support for development of models and interactive user interface, drawing from the best practices of the Huggingface ecosystems. CRS with RecWizard are modular, portable, interactive and Large Language Models (LLMs)friendly, to streamline the learning process and reduce the additional effort for CRS research. For more comprehensive information about RecWizard, please check our GitHub https://github.com/McAuley-Lab/RecWizard.



Paperid:2825
Authors:Runcong Zhao, Wenjia Zhang, Jiazheng Li, Lixing Zhu, Yanran Li, Yulan He, Lin Gui
King's College London, King's College London University of Warwick, King's College London, King's College London, Independent Researcher, King's College London University of Warwick The Alan Turing Institute, King's College London
Abstract:
In this demo, we present NarrativePlay an innovative system enabling users to role-play a fictional character and interact with dynamically generated narrative environments. Unlike existing predefined sandbox approaches, NarrativePlay centres around the main storyline events extracted from the narrative, allowing users to experience the story from the perspective of a character they chose. To design versatile AI agents for diverse scenarios, we employ a framework built on a Large Language Models (LLMs) to extract detailed character traits from text. We also incorporate automatically generated visual displays of narrative settings, character portraits, and character speech, greatly enhancing the overall user experience.



Paperid:2826
Authors:Micheal Abaho, Yousef H. Alfaifi
University of Liverpool, United Kingdom, Faculty of Computers and Information Technology, University of Tabuk, Tabuk, Saudi Arabia
Abstract:
Injecting textual information into knowledge graph (KG) entity representations has been a worthwhile expedition in terms of improving performance in KG oriented tasks within the NLP community. External knowledge often adopted to enhance KG embeddings ranges from semantically rich lexical dependency parsed features to a set of relevant key words to entire text descriptions supplied from an external corpus such as wikipedia and many more. Despite the gains this innovation (Textenhanced KG embeddings) has made, the proposal in this work suggests that it can be improved even further. Instead of using a single text description (which would not sufficiently represent an entity because of the inherent lexical ambiguity of text), we propose a multi-task framework that jointly selects a set of text descriptions relevant to KG entities as well as align or augment KG embeddings with text descriptions. Different from prior work that plugs formal entity descriptions declared in knowledge bases, this framework leverages a retriever model to selectively identify richer or highly relevant text descriptions to use in augmenting entities. Furthermore, the framework treats the number of descriptions to use in augmentation process as a parameter, which allows the flexibility of enumerating across several numbers before identifying an appropriate number. Experiment results for Link Prediction demonstrate a 5.5% and 3.5% percentage increase in the Mean Reciprocal Rank (MRR) and Hits@10 scores respectively, in comparison to text-enhanced knowledge graph augmentation methods using traditional CNNs.



Paperid:2827
Authors:Saqib Ameen, Levi H. S. Lelis
Department of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Canada, Department of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Canada
Abstract:
Costguided bottom-up search (BUS) algorithms use a cost function to guide the search to solve program synthesis tasks. In this paper, we show that current state-of-the-art cost-guided BUS algorithms suffer from a common problem: they can lose useful information given by the model and fail to perform the search in a best-first order according to a cost function. We introduce a novel best-first bottom-up search algorithm, which we call Bee Search, that does not suffer information loss and is able to perform cost-guided bottom-up synthesis in a best-first manner. Importantly, Bee Search performs best-first search with respect to the generation of programs, i.e., it does not even create in memory programs that are more expensive than the solution program. It attains best-first ordering with respect to generation by performing a search in an abstract space of program costs. We also introduce a new cost function that better uses the information provided by an existing cost model. Empirical results on string manipulation and bit-vector tasks show that Bee Search can outperform existing cost-guided BUS approaches when employing more complex domain-specific languages (DSLs); Bee Search and previous approaches perform equally well with simpler DSLs. Furthermore, our new cost function with Bee Search outperforms previous cost functions on string manipulation tasks.



Paperid:2828
Authors:Pasquale Antonante, Heath Nilsen, Luca Carlone
Massachusetts Institute of Technology, Massachusetts Institute of Technology, Massachusetts Institute of Technology
Abstract:
This paper investigates runtime monitoring of perception systems. Perception is a critical component of highintegrity applications of robotics and autonomous systems, such as self-driving cars. In these applications, failure of perception systems may put human life at risk, and a broad adoption of these technologies requires the development of methodologies to guarantee and monitor safe operation. Despite the paramount importance of perception, currently there is no formal approach for system-level perception monitoring. In this paper, we formalize the problem of runtime fault detection and identification in perception systems and present a framework to model diagnostic information using a diagnostic graph. We then provide a set of deterministic, probabilistic, and learning-based algorithms that use diagnostic graphs to perform fault detection and identification. Moreover, we investigate fundamental limits and provide deterministic and probabilistic guarantees on the fault detection and identification results. We conclude the paper with an extensive experimental evaluation, which recreates several realistic failure modes in the LGSVL open-source autonomous driving simulator, and applies the proposed system monitors to a state-of-the-art autonomous driving software stack (Baidu's Apollo Auto). The results show that the proposed system monitors outperform baselines, have the potential of preventing accidents in realistic autonomous driving scenarios, and incur a negligible computational overhead.



Paperid:2829
Authors:Alexander Braylan, Madalyn Marabella, Omar Alonso, Matthew Lease
Dept. of Computer Science, University of Texas at Austin, Dept. of Computer Science, University of Texas at Austin, Amazon, School of Information, University of Texas at Austin
Abstract:
Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. To date, many aggregation models have been proposed for simple categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks, such as those involving openended, multivariate, or structured responses. Similarly, while a variety of bespoke models have been proposed for specific tasks, our work is the first we are aware of to introduce aggregation methods that generalize across many, diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by applying readily available task-specific distance functions, then devising a task-agnostic method to model these distances between labels, rather than the labels themselves. This article presents a unified treatment of our prior work on complex annotation modeling and extends that work with investigation of three new research questions. First, how do complex annotation task and dataset properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices in order to maximize aggregation accuracy? Finally, what tests and diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct large-scale simulation studies and broad experiments on real, complex datasets. Regarding testing, we introduce the concept of unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior. Beyond investigating these research questions above, we discuss the foundational concept and nature of annotation complexity, present a new aggregation model as a conceptual bridge between traditional models and our own, and contribute a new general semisupervised learning method for complex label aggregation that outperforms prior work.



Paperid:2830
Authors:Tzu-Yi Chiu, Jerome Le Ny, Jean-Pierre David
Electrical Engineering Department, Ecole Polytechnique Montreal, Electrical Engineering Department, Ecole Polytechnique Montreal, Electrical Engineering Department, Ecole Polytechnique Montreal
Abstract:
For many automated perception and decision tasks, stateof-the-art performance may be obtained by algorithms that are too complex for their behavior to be completely understandable or predictable by human users, e.g., because they employ large machine learning models. To integrate these algorithms into safety-critical decision and control systems, it is particularly important to develop methods that can promote trust into their decisions and help explore their failure modes. In this article, we combine the anchors methodology with Monte Carlo Tree Search to provide local model-agnostic explanations for the behaviors of a given black-box model making decisions by processing time-varying input signals. Our approach searches for descriptive explanations for these decisions in the form of properties of the input signals, expressed in Signal Temporal Logic, which are highly likely to reproduce the observed behavior. To illustrate the methodology, we apply it in simulations to the analysis of a hybrid (continuous-discrete) control system and a collision avoidance system for unmanned aircraft (ACAS Xu) implemented by a neural network.



Paperid:2831
Authors:Giuseppe De Giacomo, Dror Fried, Fabio Patrizi, Shufang Zhu
University of Oxford, The Open University of Israel, Sapienza University of Rome, University of Oxford
Abstract:
Devising a strategy to make a system mimic behaviors from another system is a problem that naturally arises in many areas of Computer Science. In this work, we interpret this problem in the context of intelligent agents, from the perspective of LTLf, a formalism commonly used in AI for expressing finitetrace properties. Our model consists of two separated dynamic domains, D_A and D_B, and an LTLf specification that formalizes the notion of mimicking by mapping properties on behaviors (traces) of D_A into properties on behaviors of D_B. The goal is to synthesize a strategy that step-by-step maps every behavior of D_A into a behavior of D_B so that the specification is met. We consider several forms of mapping specifications, ranging from simple ones to full LTLf, and for each, we study synthesis algorithms and computational properties.



Paperid:2832
Authors:Eoin Delaney, Arjun Pakrashi, Derek Greene, Mark T. Keane
School of Computer Science, University College Dublin, Belfield, Dublin, Ireland Insight Centre for Data Analytics, Belfield, Dublin, Ireland VistaMilk SFI Research Centre, Belfield, Dublin, Ireland, School of Computer Science, University College Dublin, Belfield, Dublin, Ireland VistaMilk SFI Research Centre, Belfield, Dublin, Ireland, School of Computer Science, University College Dublin, Belfield, Dublin, Ireland Insight Centre for Data Analytics, Belfield, Dublin, Ireland VistaMilk SFI Research Centre, Belfield, Dublin, Ireland, School of Computer Science, University College Dublin, Belfield, Dublin, Ireland Insight Centre for Data Analytics, Belfield, Dublin, Ireland VistaMilk SFI Research Centre, Belfield, Dublin, Ireland
Abstract:
Counterfactual explanations have emerged as a popular solution for the eXplainable AI (XAI) problem of elucidating the predictions of blackbox deep-learning systems because people easily understand them, they apply across different problem domains and seem to be legally compliant. Although over 100 counterfactual methods exist in the XAI literature, each claiming to generate plausible explanations akin to those preferred by people, few of these methods have actually been tested on users (∼7%). Even fewer studies adopt a user-centered perspective; for instance, asking people for their counterfactual explanations to determine their perspective on a “good explanation”. This gap in the literature is addressed here using a novel methodology that (i) gathers human-generated counterfactual explanations for misclassified images, in two user studies and, then, (ii) compares these human-generated explanations to computationally-generated explanations for the same misclassifications. Results indicate that humans do not “minimally edit” images when generating counterfactual explanations. Instead, they make larger, “meaningful” edits that better approximate prototypes in the counterfactual class. An analysis based on “explanation goals” is proposed to account for this divergence between human and machine explanations. The implications of these proposals for future work are discussed.



Paperid:2833
Authors:Lewis Hammond, James Fox, Tom Everitt, Ryan Carey, Alessandro Abate, Michael Wooldridge
University of Oxford, United Kingdom, University of Oxford, United Kingdom, DeepMind, United Kingdom, University of Oxford, United Kingdom, University of Oxford, United Kingdom, University of Oxford, United Kingdom
Abstract:
Causal reasoning and gametheoretic reasoning are fundamental topics in artificial intelligence, among many other disciplines: this paper is concerned with their intersection. Despite their importance, a formal framework that supports both these forms of reasoning has, until now, been lacking. We offer a solution in the form of (structural) causal games, which can be seen as extending Pearl's causal hierarchy to the game-theoretic domain, or as extending Koller and Milch's multi-agent influence diagrams to the causal domain. We then consider three key questions: i) How can the (causal) dependencies in games – either between variables, or between strategies – be modelled in a uniform, principled manner? ii) How may causal queries be computed in causal games, and what assumptions does this require? iii) How do causal games compare to existing formalisms? To address question i), we introduce mechanised games, which encode dependencies between agents' decision rules and the distributions governing the game. In response to question ii), we present definitions of predictions, interventions, and counterfactuals, and discuss the assumptions required for each. Regarding question iii), we describe correspondences between causal games and other formalisms, and explain how causal games can be used to answer queries that other causal or game-theoretic models do not support. Finally, we highlight possible applications of causal games, aided by an extensive open-source Python library.



Paperid:2834
Authors:Matthew J. Holland, Kazuki Tanabe
Osaka University, Ibaraki, Osaka 567-0047 Japan, Osaka University, Ibaraki, Osaka 567-0047 Japan
Abstract:
Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about tradeoffs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss.



Paperid:2835
Authors:Kai-Chieh Hsu, Allen Z. Ren, Duy P. Nguyen, Anirudha Majumdar, Jaime F. Fisac
Department of Electrical and Computer Engineering, Princeton University, United States, Department of Mechanical and Aerospace Engineering, Princeton University, United States, Department of Electrical and Computer Engineering, Princeton University, United States, Department of Mechanical and Aerospace Engineering, Princeton University, United States, Department of Electrical and Computer Engineering, Princeton University, United States
Abstract:
Safety is a critical component of autonomous systems and remains a challenge for learningbased policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this paper, we propose Sim-to-Lab-to-Real to bridge the reality gap with a probabilistically guaranteed safety-aware policy distribution. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the Safety Bellman Equation based on Hamilton-Jacobi (HJ) reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. Additionally, inheriting from the HJ reachability analysis, the bound accounts for the expectation over the worst-case safety in each environment. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments with varying degrees of photorealism. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot. See https://sites.google.com/princeton.edu/sim-to-lab-to-real for supplementary material.



Paperid:2836
Authors:Md Shahriar Iqbal, Jianhai Su, Lars Kotthoff, Pooyan Jamshidi
University of South Carolina, Columbia, SC, USA, University of South Carolina, Columbia, SC, USA, University of Wyoming, Laramie, WY, USA, University of South Carolina, Columbia, SC, USA
Abstract:
The design of machine learning systems often requires trading off different objectives, for example, prediction error and energy consumption for deep neural networks (DNNs). Typically, no single design performs well in all objectives; therefore, finding Paretooptimal designs is of interest. The search for Pareto-optimal designs involves evaluating designs in an iterative process, and the measurements are used to evaluate an acquisition function that guides the search process. However, measuring different objectives incurs different costs. For example, the cost of measuring the prediction error of DNNs is orders of magnitude higher than that of measuring the energy consumption of a pre-trained DNN as it requires re-training the DNN. Current state-of-the-art methods do not consider this difference in objective evaluation cost, potentially incurring expensive evaluations of objective functions in the optimization process. In this paper, we develop a novel decoupled and cost-aware multi-objective optimization algorithm, which we call Flexible Multi-Objective Bayesian Optimization (FlexiBO) to address this issue. For evaluating each design, FlexiBO selects the objective with higher relative gain by weighting the improvement of the hypervolume of the Pareto region with the measurement cost of each objective. This strategy, therefore, balances the expense of collecting new information with the knowledge gained through objective evaluations, preventing FlexiBO from performing expensive measurements for little to no gain. We evaluate FlexiBO on seven state-of-the-art DNNs for image recognition, natural language processing (NLP), and speech-to-text translation. Our results indicate that, given the same total experimental budget, FlexiBO discovers designs with 4.8% to 12.4% lower hypervolume error than the best method in state-of-the-art multi-objective optimization.



Paperid:2837
Authors:Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, Tom Everitt
DeepMind, United Kingdom of Great Britain and Northern Ireland, DeepMind, United Kingdom of Great Britain and Northern Ireland, DeepMind, United Kingdom of Great Britain and Northern Ireland, DeepMind, United Kingdom of Great Britain and Northern Ireland, Imperial College London, United Kingdom of Great Britain and Northern Ireland, DeepMind, United Kingdom of Great Britain and Northern Ireland
Abstract:
Causal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is nontrivial – often the causal model is just assumed by the modeller without much justification – and modelling failures can lead to mistakes in the safety analysis. This paper proposes the first formal causal definition of agents – roughly that agents are systems that would adapt their policy if their actions influenced the world in a different way. From this we derive the first causal discovery algorithm for discovering the presence of agents from empirical data, given a set of variables and under certain assumptions. We also provide algorithms for translating between causal models and game-theoretic influence diagrams. We demonstrate our approach by resolving some previous confusions caused by incorrect causal modelling of agents.



Paperid:2838
Authors:W. Bradley Knox, Alessandro Allievi, Holger Banzhaf, Felix Schmitt, Peter Stone
Robert Bosch LLC , United States of America The University of Texas at Austin, United States of America, Robert Bosch LLC , United States of America The University of Texas at Austin, United States of America, Robert Bosch GmbH, Germany, Bosch Center for Artificial Intelligence, Germany, The University of Texas at Austin, United States of America Sony AI, United States of America
Abstract:
This article considers the problem of diagnosing certain common errors in reward design. Its insights are also applicable to the design of cost functions and performance metrics more generally. To diagnose common errors, we develop 8 simple sanity checks for identifying flaws in reward functions. We survey research that is published in toptier venues and focuses on reinforcement learning (RL) for autonomous driving (AD). Specifically, we closely examine the reported reward function in each publication and present these reward functions in a complete and standardized format in the appendix. Wherever we have sufficient information, we apply the 8 sanity checks to each surveyed reward function, revealing near-universal flaws in reward design for AD that might also exist pervasively across reward design for other tasks. Lastly, we explore promising directions that may aid the design of reward functions for AD in subsequent research, following a process of inquiry that can be adapted to other domains.



Paperid:2839
Authors:Vid Kocijan, Ernest Davis, Thomas Lukasiewicz, Gary Marcus, Leora Morgenstern
Kumo.ai, 357 Castro Street, Suite 200 Mountain View, CA 94041, United States, New York University, Department of Computer Science, 251 Mercer St, NY 10012, United States, Institute of Logic and Computation, Vienna University of Technology, Austria Department of Computer Science, University of Oxford, UK, New York University, New York, NY 10012, United States, Palo Alto Research Center, part of SRI International, 3333 Coyote Hill Rd, Palo Alto, CA 94304, United States
Abstract:
The Winograd Schema Challenge—a set of twin sentences involving pronoun reference disambiguation that seem to require the use of commonsense knowledge—was proposed by Hector Levesque in 2011. By 2019, a number of AI systems, based on large pretrained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy. In this paper, we review the history of the Winograd Schema Challenge and discuss the lasting contributions of the flurry of research that has taken place on the WSC in the last decade. We discuss the significance of various datasets developed for WSC, and the research community's deeper understanding of the role of surrogate tasks in assessing the intelligence of an AI system.



Paperid:2840
Authors:Jian Li, Yong Liu, Weiping Wang
Institute of Information Engineering, Chinese Academy of Sciences, China, Gaoling School of Artificial Intelligence, Renmin University of China, China, Institute of Information Engineering, Chinese Academy of Sciences, China
Abstract:
Kernel methods are powerful tools to capture nonlinear patterns behind given data but often lead to poor performance on complicated tasks compared to convolutional neural networks. The reason is that kernel methods are still shallow and fully connected models, failing to reveal hierarchical features and local interdependencies. In this paper, to acquire hierarchical and local knowledge, we incorporate kernel methods with deep architectures and convolutional operators in a spectral kernel learning framework. Based on the inverse Fourier transform and Rademacher complexity theory, we provide the generalization error bounds for the proposed model and prove that under suitable initialization, deeper networks lead to tighter error bounds. Inspired by theoretical findings, we finally completed the convolutional spectral kernel network (CSKN) with two additional regularizers and an initialization strategy. Extensive ablation results validate the effectiveness of nonstationary spectral kernel, multiple layers, additional regularizers, and the convolutional filters, which coincide with our theoretical findings. We further devise a VGG-type 8-layers CSKN, and it outperforms the existing kernel-based networks and popular CNN models on the medium-sized image classification tasks.



Paperid:2841
Authors:Xuhong Li, Haoyi Xiong, Xingjian Li, Xiao Zhang, Ji Liu, Haiyan Jiang, Zeyu Chen, Dejing Dou
Baidu Inc., Beijing, China, Baidu Inc., Beijing, China, Baidu Inc., Beijing, China, Tsinghua University, Beijing, China, Baidu Inc., Beijing, China, Baidu Inc., Beijing, China, Baidu Inc., Beijing, China, Baidu Inc., Beijing, China
Abstract:
To explain the prediction result of a Deep Neural Network (DNN) model based on a given sample, LIME [1] and its derivatives have been proposed to approximate the local behavior of the DNN model around the data point via linear surrogates. Though these algorithms interpret the DNN by finding the key features used for classification, the random interpolations used by LIME would perturb the explanation result and cause the instability and inconsistency between repetitions of LIME computations. To tackle this issue, we propose GLIME that extends the vanilla LIME through high-dimensional Bayesian linear regression using the sparsity and informative global priors. Specifically, with a dataset representing the population of samples (e.g., the training set), G-LIME first pursues the global explanation of the DNN model using the whole dataset. Then, with a new data point, -LIME incorporates an modified estimator of ElasticNet-alike to refine the local explanation result through balancing the distance to the global explanation and the sparsity/feature selection in the explanation. Finally, G-LIME uses Least Angle Regression (LARS) and retrieves the solution path of a modified ElasticNet under varying -regularization, to screen and rank the importance of features [2] as the explanation result. Through extensive experiments on real world tasks, we show that the proposed method yields more stable, consistent, and accurate results compared to LIME.



Paperid:2842
Authors:Vincent Liu, James R. Wright, Mrtha White
University of Alberta and Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada, University of Alberta and Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada, University of Alberta and Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada
Abstract:
Offline reinforcement learning—learning a policy from a batch of data—is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) and have limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of realworld domains including financial markets. We discuss algorithms that exploit the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.



Paperid:2843
Authors:Martin Michalowski, Robert Moskovitch, Nitesh V. Chawla
University of Minnesota, Minneapolis MN USA, Ben-Gurion University of the Negev, Beersheba Israel, University of Notre Dame, Notre Dame IN USA
Abstract:
The human race is facing one of the most meaningful public health emergencies in the modern era caused by the COVID19 pandemic. This pandemic introduced various challenges, from lock-downs with significant economic costs to fundamentally altering the way of life for many people around the world. The battle to understand and control the virus is still at its early stages yet meaningful insights have already been made. The uncertainty of why some patients are infected and experience severe symptoms, while others are infected but asymptomatic, and others are not infected at all, makes managing this pandemic very challenging. Furthermore, the development of treatments and vaccines relies on knowledge generated from an ever evolving and expanding information space. Given the availability of digital data in the modern era, artificial intelligence (AI) is a meaningful tool for addressing the various challenges introduced by this unexpected pandemic. Some of the challenges include: outbreak prediction, risk modeling including infection and symptom development, testing strategy optimization, drug development, treatment repurposing, vaccine development, and others.



Paperid:2844
Authors:João G. Ribeiro, Gonçalo Rodrigues, Alberto Sardinha, Francisco S. Melo
INESC-ID, IST Taguspark, Av. Prof. Dr. Cavaco Silva, Porto Salvo, 2744-016, Portugal, Google, Google Building 110, Brandschenkestrasse 110, Zürich, 8002, Switzerland, INESC-ID, IST Taguspark, Av. Prof. Dr. Cavaco Silva, Porto Salvo, 2744-016, Portugal Department of Informatics, Pontifical Catholic University of Rio de Janeiro, Brazil, INESC-ID, IST Taguspark, Av. Prof. Dr. Cavaco Silva, Porto Salvo, 2744-016, Portugal
Abstract:
This paper investigates the use of modelbased reinforcement learning in the context of ad hoc teamwork. We introduce a novel approach, named TEAMSTER, where we propose learning both the environment's model and the model of the teammates' behavior separately. Compared to the state-of-the-art PLASTIC algorithms, our results in four different domains from the multi-agent systems literature show that TEAMSTER is more flexible than the PLASTIC-Model, by learning the environment's model instead of assuming a perfect hand-coded model, and more robust/efficient than PLASTIC-Policy, by being able to continuously adapt to newly encountered teams, without implicitly learning a new environment model from scratch.



Paperid:2845
Authors:Patrick Rodler
University of Klagenfurt, Universitätsstr. 65-67, Klagenfurt, Austria
Abstract:
Modelbased diagnosis aims at identifying the real cause of a system's malfunction based on a formal system model and observations of the system behavior. To discriminate between multiple fault hypotheses (diagnoses), sequential diagnosis approaches iteratively pose queries to an oracle to acquire additional knowledge about the diagnosed system. Depending on the system type, queries can capture, e.g., system tests, probes, measurements, or expert questions. As the determination of optimal queries is NP-hard, state-of-the-art sequential diagnosis methods rely on a myopic one-step-lookahead analysis which has proven to constitute a particularly favorable trade-off between computational efficiency and diagnostic effectivity. Yet, this solves only a part of the problem, as various sources of complexity, such as the reliance on costly reasoning services and large numbers of or not explicitly given query candidates, remain. To deal with such issues, existing approaches often make assumptions about the (i) type of diagnosed system, (ii) formalism to describe the system, (iii) inference engine, (iv) type of query to be of interest, (v) query quality criterion to be adopted, or (vi) diagnosis computation algorithm to be employed. Moreover, they (vii) often cannot deal with large or implicit query spaces or with expressive logics, or (viii) require inputs that cannot always be provided. As a remedy, we propose a novel one-step lookahead query computation technique for sequential diagnosis that overcomes the said issues of existing methods. Our approach (1) is based on a solid theory, (2) involves a systematic search for optimal queries, (3) can operate on implicit and huge query spaces, (4) allows for a two-stage optimization of queries (wrt. their number and cost), (5) is designed to reduce expensive logical inferences to a minimum, and (6) is generally applicable. The latter means that it can deal with any type of diagnosis problem as per Reiter's theory, is applicable with any monotonic knowledge representation language, can interact with a multitude of diagnosis engines and logical reasoners, and allows for a quality optimization of queries based on any of the common criteria in the literature. We extensively study the performance of the novel technique using a benchmark of real-world diagnosis problems. Our findings are that our approach enables the computation of optimal queries with hardly any delay, independently of the size and complexity of the considered benchmark problem. Moreover, it proves to be highly scalable, and it outperforms the state-of-the-art method in the domain of our benchmarks by orders of magnitude in terms of computation time while always returning a qualitatively as good or better query.



Paperid:2846
Authors:Baturay Saglam, Furkan Mutlu, Dogan Cicek, Suleyman Kozat
Department of Electrical Engineering, Yale University, New Haven, CT, USA, Department of Electrical and Electronics Engineering, Bilkent University, Ankara, Turkey, Department of Electrical and Electronics Engineering, Bilkent University, Ankara, Turkey, Department of Electrical and Electronics Engineering, Bilkent University, Ankara, Turkey
Abstract:
A widelystudied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms off-policy actor-critic algorithms. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. The introduced algorithm suggests a new branch of improvements to PER and schedules effective and efficient training for both actor and critic networks. An extensive set of experiments verifies our theoretical findings, showing that our method outperforms competing approaches and achieves state-of-the-art results over the standard off-policy actor-critic algorithms.



Paperid:2847
Authors:Arnab Sarker, Peter Fisher, Joseph Gaudio, Anuradha Annaswamy
Massachusetts Institute of Technology, United States of America, Massachusetts Institute of Technology, United States of America, Massachusetts Institute of Technology, United States of America, Massachusetts Institute of Technology, United States of America
Abstract:
Analysis and synthesis of safetycritical autonomous systems are carried out using models which are often dynamic. Two central features of these dynamic systems are parameters and unmodeled dynamics. Much of feedback control design is parametric in nature and as such, accurate and fast estimation of the parameters in the modeled part of the dynamic system is a crucial property for designing risk-aware autonomous systems. This paper addresses the use of a spectral lines-based approach for estimating parameters of the dynamic model of an autonomous system. Existing literature has treated all unmodeled components of the dynamic system as sub-Gaussian noise and proposed parameter estimation using Gaussian noise-based exogenous signals. In contrast, we allow the unmodeled part to have deterministic unmodeled dynamics, which are almost always present in physical systems, in addition to sub-Gaussian noise. In addition, we propose a deterministic construction of the exogenous signal in order to carry out parameter estimation. We introduce a new tool kit which employs the theory of spectral lines, retains the stochastic setting, and leads to non-asymptotic bounds on the parameter estimation error. Unlike the existing stochastic approach, these bounds are tunable through an optimal choice of the spectrum of the exogenous signal leading to accurate parameter estimation. We also show that this estimation is robust to unmodeled dynamics, a property that is not assured by the existing approach. Finally, we show that under ideal conditions with no deterministic unmodeled dynamics, the proposed approach can ensure a Õ(√t) Regret, matching existing literature. Experiments are provided to support all theoretical derivations, which show that the spectral lines-based approach outperforms the Gaussian noise-based method when unmodeled dynamics are present, in terms of both parameter estimation error and Regret obtained using the parameter estimates with a Linear Quadratic Regulator in feedback.



Paperid:2848
Authors:Wout Schellaert, Fernando Martínez-Plumed, Karina Vold, John Burden, Pablo A. M. Casares, Bao Sheng Loe, Roi Reichart, Sean Ó hÉigeartaigh, Anna Korhonen, José Hernández-Orallo
VRAIN, Universitat Politècnica de València, Spain, VRAIN, Universitat Politècnica de València, Spain, Institute for the History and Philosophy of Science and Technology, University of Toronto, Canada, Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK, Universidad Complutense de Madrid, Spain, Psychometrics Centre, University of Cambridge, UK, Technion - Israel Institute of Technology, Israel, Centre for the Study of Existential Risk, University of Cambridge, UK, Language Technology Laboratory (LTL), University of Cambridge, UK, VRAIN, Universitat Politècnica de València, Spain
Abstract:
Even with obvious deficiencies, large promptcommanded multimodal models are proving to be flexible cognitive tools representing an unprecedented generality. But the directness, diversity, and degree of user interaction create a distinctive “human-centred generality” (HCG), rather than a fully autonomous one. HCG implies that —for a specific user— a system is only as general as it is effective for the user’s relevant tasks and their prevalent ways of prompting. A human-centred evaluation of general-purpose AI systems therefore needs to reflect the personal nature of interaction, tasks and cognition. We argue that the best way to understand these systems is as highly-coupled cognitive extenders, and to analyse the bidirectional cognitive adaptations between them and humans. In this paper, we give a formulation of HCG, as well as a high-level overview of the elements and trade-offs involved in the prompting process. We end the paper by outlining some essential research questions and suggestions for improving evaluation practices, which we envision as characteristic for the evaluation of general artificial intelligence in the future.



Paperid:2849
Authors:Richard S. Sutton, Marlos C. Machado, G. Zacharias Holland, David Szepesvari, Finbarr Timbers, Brian Tanner, Adam White
DeepMind, Edmonton, Alberta, Canada University of Alberta, Edmonton, Alberta, Canada Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada Canada CIFAR AI Chair, Canada, DeepMind, Edmonton, Alberta, Canada University of Alberta, Edmonton, Alberta, Canada Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada Canada CIFAR AI Chair, Canada, DeepMind, Edmonton, Alberta, Canada, DeepMind, Edmonton, Alberta, Canada, DeepMind, Edmonton, Alberta, Canada, DeepMind, Edmonton, Alberta, Canada, DeepMind, Edmonton, Alberta, Canada University of Alberta, Edmonton, Alberta, Canada Alberta Machine Intelligence Institute (Amii), Edmonton, Alberta, Canada Canada CIFAR AI Chair, Canada
Abstract:
To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such rewardrespecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.



Paperid:2850
Authors:Seid Miad Zandavi
School of Computer Science, The University of Sydney, Sydney, Australia School of Biotechnology and Biomolecular Sciences, The University of New South Wales (UNSW), Sydney, Australia Department of Pediatrics, Harvard Medical School, Boston, MA, USA
Abstract:
A new method is proposed to increase the accuracy of the stateof-the-art single image super-resolution (SISR) using novel training procedure. The proposed method, named post-trained convolutional neural network (CNN), is carried out stochastic dual simplex algorithm (SDSA) in the last reconstruction layer. The method utilizes contextual information to update the last reconstruction layer of CNN. The extracted contextual information is projected to the last reconstructed layer by optimized weights and the bias is managed through SDSA. Post-trained CNN is applied to the very deep super-resolution (VDSR) method to show its performance. The quantitative and visual results demonstrate that the proposed post-trained VDSR (PTVDSR) exhibits excellent and competitive performance when compared with the VDSR and other super-resolution methods.



Paperid:2851
Authors:Chao Chen, Jie Liu, Chang Zhou, Jie Tang, Gangshan Wu
Nanjing University, Nanjing University, Nanjing University, Nanjing University, Nanjing University
Abstract:
Lane detection is to determine the precise location and shape of lanes on the road. Despite efforts made by current methods, it remains a challenging task due to the complexity of realworld scenarios. Existing approaches, whether proposal-based or keypoint-based, suffer from depicting lanes effectively and efficiently. Proposal-based methods detect lanes by distinguishing and regressing a collection of proposals in a streamlined top-down way, yet lack sufficient flexibility in lane representation. Keypoint-based methods, on the other hand, construct lanes flexibly from local descriptors, which typically entail complicated post-processing. In this paper, we present a “Sketch-and-Refine” paradigm that utilizes the merits of both keypoint-based and proposal-based methods. The motivation is that local directions of lanes are semantically simple and clear. At the “Sketch” stage, local directions of keypoints can be easily estimated by fast convolutional layers. Then we can build a set of lane proposals accordingly with moderate accuracy. At the “Refine” stage, we further optimize these proposals via a novel Lane Segment Association Module (LSAM), which allows adaptive lane segment adjustment. Last but not least, we propose multi-level feature integration to enrich lane feature representations more efficiently. Based on the proposed “Sketch-and-Refine” paradigm, we propose a fast yet effective lane detector dubbed “SRLane”. Experiments show that our SRLane can run at a fast speed (i.e., 278 FPS) while yielding an F1 score of 78.9%. The source code is available at: https://github.com/passerer/SRLane.



Paperid:2852
Authors:Hongxu Chen, Quan Zhang, Jian-Huang Lai, Xiaohua Xie
School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China, School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China, School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China Pazhou Lab (HuangPu), Guangdong 510000, China Guangdong Key Laboratory of Information Security Technology, Guangzhou 510006, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China, School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, China Pazhou Lab (HuangPu), Guangdong 510000, China Guangdong Key Laboratory of Information Security Technology, Guangzhou 510006, China Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
Abstract:
Group reidentification (G-ReID) aims to correctly associate groups with the same members captured by different cameras. However, supervised approaches for this task often suffer from the high cost of cross-camera sample labeling. Unsupervised methods based on clustering can avoid sample labeling, but the problem of member variations often makes clustering unstable, leading to incorrect pseudo-labels. To address these challenges, we propose an adaptive clustering-driven progressive learning approach (ACPL), which consists of a group adaptive clustering (GAC) module and a global dynamic prototype update (GDPU) module. Specifically, GAC designs the quasi-distance between groups, thus fully capitalizing on both individual-level and holistic information within groups. In the case of great uncertainty in intra-group members, GAC effectively minimizes the impact of non-discriminative features and reduces the noise in the model's pseudo-labels. Additionally, our GDPU devises a dynamic weight to update the prototypes and effectively mine the hard samples with complex member variations, which improves the model's robustness. Extensive experiments conducted on four popular G-ReID datasets demonstrate that our method not only achieves state-of-the-art performance on unsupervised G-ReID but also performs comparably to several fully supervised approaches.



Paperid:2853
Authors:Zhenkai Lin, Yanli Ji, Yang Yang
University of Electronic Science and Technology of China, University of Electronic Science and Technology of China, University of Electronic Science and Technology of China
Abstract:
The sound mixture separation is still challenging due to heavy sound overlapping and disturbance from noise. Unsupervised separation would significantly increase the difficulty. As sound overlapping always hinders accurate sound separation, we propose an Independency Adversarial Learning based CrossModal Sound Separation (IAL-CMS) approach, where IAL employs adversarial learning to minimize the correlation of separated sound elements, exploring high sound independence; CMS performs cross-modal sound separation, incorporating audio-visual consistent feature learning and interactive cross-attention learning to emphasize the semantic consistency among cross-modal features. Both audio-visual consistency and audio consistency are kept to guarantee accurate separation. The consistency and sound independence ensure the decomposition of overlapping mixtures into unrelated and distinguishable sound elements. The proposed approach is evaluated on MUSIC, VGGSound, and AudioSet. Extensive experiments certify that our approach outperforms existing approaches in supervised and unsupervised scenarios.



Paperid:2854
Authors:Chenbo Zhang, Yinglu Zhang, Lu Zhang, Jiajia Zhao, Jihong Guan, Shuigeng Zhou
Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, China, Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, China, Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, China, Science and Technology on Complex System Control and Intelligent Agent Cooperation Laboratory, Beijing Electro-Mechanical Engineering Institute, China, Department of Computer Science & Technology, Tongji University, China, Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, China
Abstract:
In recent years, Fewshot Object Detection (FSOD) has become an increasingly important research topic in computer vision. However, existing FSOD methods require strong annotations including category labels and bounding boxes, and their performance is heavily dependent on the quality of box annotations. However, acquiring strong annotations is both expensive and time-consuming. This inspires the study on weakly supervised FSOD (WS-FSOD in short), which realizes FSOD with only image-level annotations, i.e., category labels. In this paper, we propose a new and effective weakly supervised FSOD method named WFS-DETR. By a well-designed pretraining process, WFS-DETR first acquires general object localization and integrity judgment capabilities on large-scale pretraining data. Then, it introduces object integrity into multiple-instance learning to solve the common local optimum problem by comprehensively exploiting both semantic and visual information. Finally, with simple fine-tuning, it transfers the knowledge learned from the base classes to the novel classes, which enables accurate detection of novel objects. Benefiting from this ``pretraining-refinement'' mechanism, WSF-DETR can achieve good generalization on different datasets. Extensive experiments also show that the proposed method clearly outperforms the existing counterparts in the WS-FSOD task.



Paperid:2855
Authors:Riwei Lai, Rui Chen, Qilong Han, Chi Zhang, Li Chen
College of Computer Science and Technology, Harbin Engineering University Department of Computer Science, Hong Kong Baptist University, College of Computer Science and Technology, Harbin Engineering University, College of Computer Science and Technology, Harbin Engineering University, College of Computer Science and Technology, Harbin Engineering University, Department of Computer Science, Hong Kong Baptist University
Abstract:
Negative sampling is essential for implicit collaborative filtering to provide proper negative training signals so as to achieve desirable performance. We experimentally unveil a common limitation of all existing negative sampling methods that they can only select negative samples of a fixed hardness level, leading to the false positive problem (FPP) and false negative problem (FNP). We then propose a new paradigm called adaptive hardness negative sampling (AHNS) and discuss its three key criteria. By adaptively selecting negative samples with appropriate hardnesses during the training process, AHNS can well mitigate the impacts of FPP and FNP. Next, we present a concrete instantiation of AHNS called AHNS_{p<0}, and theoretically demonstrate that AHNS_{p<0} can fit the three criteria of AHNS well and achieve a larger lower bound of normalized discounted cumulative gain. Besides, we note that existing negative sampling methods can be regarded as more relaxed cases of AHNS. Finally, we conduct comprehensive experiments, and the results show that AHNS_{p<0} can consistently and substantially outperform several stateof-the-art competitors on multiple datasets.



Paperid:2856
Authors:Wael Hassanieh, Abdallah Chehade
University of Michigan-Dearborn, University of Michigan-Dearborn
Abstract:
In light of the advances in big data, highdimensional datasets are often encountered. Incorporating them into data-driven models can enhance performance; however, this comes at the cost of high computation and the risk of overfitting, particularly due to abundant redundant features. Identifying an informative subset of the features helps in reducing the dimensionality and enhancing model interpretability. In this paper, we propose a novel framework for unsupervised feature selection, called Selective Deep Auto-Encoder (SDAE). It aims to reduce the number of features used in unlabeled datasets without compromising the quality of information obtained. It achieves this by selecting sufficient features - from the original feature set - capable of representing the entire feature space and reconstructing them. Architecturally, it leverages the use of highly nonlinear latent representations in deep Autoencoders and intrinsically learns, in an unsupervised fashion, the relevant and globally representative subset of features through a customized Selective Layer. Extensive experimental results on three high-dimensional public datasets have shown promising feature selection performance by SDAE in comparison to other existing state-of-the-art unsupervised feature selection methods.



Paperid:2857
Authors:Fang Kong, Shuai Li
Shanghai Jiao Tong University, Shanghai Jiao Tong University
Abstract:
Twosided matching markets have been widely studied in the literature due to their rich applications. Since participants are usually uncertain about their preferences, online algorithms have recently been adopted to learn them through iterative interactions. An existing work initiates the study of this problem in a many-to-one setting with responsiveness. However, their results are far from optimal and lack guarantees of incentive compatibility. We first extend an existing algorithm for the one-to-one setting to this more general setting and show it achieves a near-optimal bound for player-optimal regret. Nevertheless, due to the substantial requirement for collaboration, a single player's deviation could lead to a huge increase in its own cumulative rewards and a linear regret for others. In this paper, we aim to enhance the regret bound in many-to-one markets while ensuring incentive compatibility. We first propose the adaptively explore-then-deferred-acceptance (AETDA) algorithm for responsiveness setting and derive an upper bound for player-optimal stable regret while demonstrating its guarantee of incentive compatibility. This result is a significant improvement over existing works. And to the best of our knowledge, it constitutes the first player-optimal guarantee in matching markets that offers such robust assurances. We also consider broader substitutable preferences, one of the most general conditions to ensure the existence of a stable matching and cover responsiveness. We devise an online DA (ODA) algorithm and establish an upper bound for the player-pessimal stable regret for this setting.



Paperid:2858
Authors:Kumar Shubham, Pranav Sastry, Prathosh AP
Indian Institute of Science, Bangalore, India, Indian Institute of Science, Bangalore, India, Indian Institute of Science, Bangalore, India
Abstract:
Programmatic Weak Supervision (PWS) and generative models serve as crucial tools that enable researchers to maximize the utility of existing datasets without resorting to laborious data gathering and manual annotation processes. PWS uses various weak supervision techniques to estimate the underlying class labels of data, while generative models primarily concentrate on sampling from the underlying distribution of the given dataset. Although these methods have the potential to complement each other, they have mostly been studied independently. Recently, WSGAN proposed a mechanism to fuse these two models. Their approach utilizes the discrete latent factors of InfoGAN for the training of the label models and leverages the classdependent information of the label models to generate images of specific classes. However, the disentangled latent factor learned by the InfoGAN may not necessarily be class specific and hence could potentially affect the label model's accuracy. Moreover, the prediction of the label model is often noisy in nature and can have a detrimental impact on the quality of images generated by GAN. In our work, we address these challenges by (i) implementing a noise-aware classifier using the pseudo labels generated by the label model, (ii) utilizing the prediction of the noise-aware classifier for training the label model as well as generation of class-conditioned images. Additionally, We also investigate the effect of training the classifier with a subset of the dataset within a defined uncertainty budget on pseudo labels. We accomplish this by formalizing the subset selection problem as submodular maximization with a knapsack constraint on the entropy of pseudo labels. We conduct experiments on multiple datasets and demonstrate the efficacy of our methods on several tasks vis-a-vis the current state-of-the-art methods. Our implementation is available at https://github.com/kyrs/subpws-gan



Paperid:2859
Authors:Mengdi Wang, Anna Bodonhelyi, Efe Bozkir, Enkelejda Kasneci
Technical University of Munich, Munich, Bavaria, Germany, Technical University of Munich, Munich, Bavaria, Germany, Technical University of Munich, Munich, Bavaria, Germany, Technical University of Munich, Munich, Bavaria, Germany
Abstract:
Federated learning is a distributed collaborative machine learning paradigm that has gained strong momentum in recent years. In federated learning, a central server periodically coordinates models with clients and aggregates the models trained locally by clients without necessitating access to local data. Despite its potential, the implementation of federated learning continues to encounter several challenges, predominantly the slow convergence that is largely due to data heterogeneity. The slow convergence becomes particularly problematic in crossdevice federated learning scenarios where clients may be strongly limited by computing power and storage space, and hence counteracting methods that induce additional computation or memory cost on the client side such as auxiliary objective terms and larger training iterations can be impractical. In this paper, we propose a novel federated aggregation strategy, TurboSVM-FL, that poses no additional computation burden on the client side and can significantly accelerate convergence for federated classification task, especially when clients are "lazy" and train their models solely for few epochs for next global aggregation. TurboSVM-FL extensively utilizes support vector machine to conduct selective aggregation and max-margin spread-out regularization on class embeddings. We evaluate TurboSVM-FL on multiple datasets including FEMNIST, CelebA, and Shakespeare using user-independent validation with non-iid data distribution. Our results show that TurboSVM-FL can significantly outperform existing popular algorithms on convergence rate and reduce communication rounds while delivering better test metrics including accuracy, F1 score, and MCC.



Paperid:2860
Authors:Muyao Wang, Wenchao Chen, Bo Chen
Xidian university, Xidian University, Xidian University
Abstract:
The forecasting of Multivariate Time Series (MTS) has long been an important but challenging task. Due to the nonstationary problem across long-distance time steps, previous studies primarily adopt stationarization method to attenuate the non-stationary problem of original series for better predictability. However, existed methods always adopt the stationarized series, which ignore the inherent non-stationarity, and have difficulty in modeling MTS with complex distributions due to the lack of stochasticity. To tackle these problems, we first develop a powerful hierarchical probabilistic generative module to consider the non-stationarity and stochastity characteristics within MTS, and then combine it with transformer for a well-defined variational generative dynamic model named Hierarchical Time series Variational Transformer (HTV-Trans), which recovers the intrinsic non-stationary information into temporal dependencies. Being an powerful probabilistic model, HTV-Trans is utilized to learn expressive representations of MTS and applied to the forecasting tasks. Extensive experiments on diverse datasets show the efficiency of HTV-Trans on MTS forecasting tasks.